# Using `judges` ⚖️ to Build and Leverage LLM Evaluators  


Evaluating the outputs of Large Language Models (LLMs) is often a challenging task, requiring nuanced criteria that are difficult to quantify and even harder to automate. For instance, how do we reliably assess if a model’s response is:  
- factually accurate?  
- concise yet comprehensive?  
- free of hallucinations?  
- aligned with ethical and domain-specific guidelines?  

These questions demand a human-like understanding that traditional metrics like BLEU or ROUGE often fail to capture. Crafting rule-based systems for such evaluations is equally daunting due to the subjective and complex nature of these tasks.  

✅ Enter `judges`: an open-source library that simplifies and streamlines LLM evaluations with pre-built and customizable evaluators, inspired by research-backed methods like LLM-as-a-Judge. It’s designed for a wide range of use cases, from factual correctness to hallucination detection, and offers a low-friction interface for both quick setups and advanced customizations.  

💡 The core idea is simple yet transformative: use LLMs themselves to evaluate other LLMs—efficiently, scalably, and with human-like reasoning.

🤖✓ The prompts behind these LLM judges are backed by state-of-the-art research, including influential works such as _"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"_ ([Wang et al., 2023](https://arxiv.org/abs/2306.05685)) and _"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models"_([Hu et al., 2024](https://arxiv.org/abs/2404.18796)).  

In this cookbook, we’ll use a subset of Google’s _Natural Questions_ dataset to demonstrate the use of `judges`. The task involves:  
1. Starting with a pre-annotated dataset where responses are labeled as "good" or "bad" based on Wikipedia content.  
2. Using AI search engines like Perplexity, EXA, and Gemini to generate responses similar to the "good" examples.  
3. Applying `judges` to evaluate these responses for aspects like correctness and hallucination.   

Through this process, you’ll see how `judges` can simplify evaluation workflows while maintaining rigor and scalability. Let’s dive in!

In [59]:
if HF_API_KEY:
    !huggingface-cli login --token $HF_API_KEY
else:
    print("Hugging Face API key not found.")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/jamesliounis/.cache/huggingface/token
Login successful


In [3]:
# !python --version
# !brew update
# !brew upgrade python



In [48]:
# !pip install judges datasets google-generativeai
#!pip install exa_py
# !pip uninstall openai --y
# #!pip install openai==1.57.1 httpx==0.27.2
# !pip install --upgrade openai

In [62]:
import pandas as pd
from dotenv import load_dotenv
import os
from IPython.display import Markdown, HTML
from tqdm import tqdm

In [60]:
from datasets import load_dataset

# Replace "quotientai/labeled-natural-qa-random-100" with the actual dataset path
dataset = load_dataset("quotientai/labeled-natural-qa-random-100")

data = dataset['train'].to_pandas()

data.head()


Downloading readme:   0%|          | 0.00/3.60k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100 [00:00<?, ? examples/s]

Unnamed: 0,url,input_text,completion,label,feedback
0,https://en.wikipedia.org//w/index.php?title=De...,who did deion sanders go in the hall of fame as,['Cornerback'],bad,The answer is incorrect. The question is askin...
1,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],good,
2,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",good,
3,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",good,
4,https://en.wikipedia.org//w/index.php?title=Th...,where does the show the path take place,['Upstate New York'],good,


In [61]:
# filter to have only the "good" data points

data = data[data['label'] == 'good']

## Generating answers to our queries using AI Search engines

You can either set the API keys from a `.env` file, such as what we are doing below, or from Google Colab secrets for which you may use the commented-out commands

In [66]:
# load_dotenv()

# print(f"EXA_API_KEY: {EXA_API_KEY}")
# print(f"GOOGLE_API_KEY: {GOOGLE_API_KEY}")
# print(f"OPENAI_API_KEY: {OPENAI_API_KEY}")
# print(f"PERPLEXITY_API_KEY: {PERPLEXITY_API_KEY}")
# print(f"HF_API_KEY: {HF_API_KEY}")


In [16]:
EXA_API_KEY = os.getenv('EXA_API_KEY')
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
PERPLEXITY_API_KEY = os.getenv('PERPLEXITY_API_KEY')
HF_API_KEY = os.getenv('HF_API_KEY')

### Gemini

We use the Gemini API with the grounding option, following [official Google documentation](https://ai.google.dev/gemini-api/docs/grounding?lang=python).

In [23]:
# from google.colab import userdata
import google.generativeai as genai
from IPython.display import Markdown, HTML

# GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

We first test out the Gemini client to see if everything works as planned!

In [24]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')
response = model.generate_content(contents="What is the land area of Spain?",
                                  tools='google_search_retrieval')

In [25]:
Markdown(response.candidates[0].content.parts[0].text)

Spain's land area covers approximately 500,000 square kilometers.  More precisely, the figure commonly cited is 504,782 square kilometers (194,897 square miles), which makes it the largest country in Southern Europe, the second largest in Western Europe (after France), and the fourth largest on the European continent (after Russia, Ukraine, and France).

Including its island territories—the Balearic Islands in the Mediterranean and the Canary Islands in the Atlantic—the total area increases slightly to around 505,370 square kilometers.  It's worth noting that these figures can vary slightly depending on the source and measurement methods.  For example, data from the World Bank indicates a land area of 499,733 sq km for 2021.  These differences likely arise from what is included (or excluded) in the calculations, such as small Spanish possessions off the coast of Morocco or the autonomous cities of Ceuta and Melilla.


In [None]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')


def search_with_gemini(input_text):
  response = model.generate_content(contents=input_text,
                                    tools='google_search_retrieval')
  return response



parse_gemini_output = lambda x: x.candidates[0].content.parts[0].text


<function __main__.<lambda>(x)>

Once we have our functions ready, we run inference on our data. 

In [65]:
tqdm.pandas()

data['gemini_response'] = data['input_text'].progress_apply(search_with_gemini)

100%|██████████| 67/67 [05:04<00:00,  4.54s/it]


In [67]:
data['gemini_response_parsed'] = data['gemini_response'].apply(parse_gemini_output)

### Perplexity

We quickstart the API using [this documentation](https://www.perplexity.ai/hub/blog/introducing-pplx-api).

In [None]:
# PERPLEXITY_API_KEY=userdata.get('PERPLEXITY_API_KEY')

In [71]:
import requests


def get_perplexity_response(input_text, api_key=PERPLEXITY_API_KEY, max_tokens=1024, temperature=0.2, top_p=0.9):
    """
    Sends an input text to the Perplexity API and retrieves a response.

    Args:
        input_text (str): The user query to send to the API.
        api_key (str): The Perplexity API key for authorization.
        max_tokens (int): Maximum number of tokens for the response.
        temperature (float): Sampling temperature for randomness in responses.
        top_p (float): Nucleus sampling parameter.

    Returns:
        dict: The JSON response from the API if successful.
        str: Error message if the request fails.
    """
    url = "https://api.perplexity.ai/chat/completions"

    # Define the payload
    payload = {
        "model": "llama-3.1-sonar-small-128k-online",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant. Be precise and concise."
            },
            {
                "role": "user",
                "content": input_text
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "search_domain_filter": ["perplexity.ai"],
        "return_images": False,
        "return_related_questions": False,
        "search_recency_filter": "month",
        "top_k": 0,
        "stream": False,
        "presence_penalty": 0,
        "frequency_penalty": 1
    }

    # Define the headers
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    # Make the API request
    response = requests.post(url, json=payload, headers=headers)

    # Check and return the response
    if response.status_code == 200:
        return response.json()  # Return the JSON response
    else:
        return f"Error: {response.status_code}, {response.text}"


In [72]:
parse_perplexity_output = lambda response: response['choices'][0]['message']['content']

In [73]:
tqdm.pandas()

data['perplexity_response'] = data['input_text'].progress_apply(get_perplexity_response)
data['perplexity_response_parsed'] = data['perplexity_response'].apply(parse_perplexity_output)

100%|██████████| 67/67 [02:12<00:00,  1.98s/it]


## Exa AI

Exa doesn't have an integrated RAG API based on search results the same way that Perplexity and Gemini have. Instead, what they provide is a wrapper around OpenAI, for which we refer to [this documentation](https://docs.exa.ai/reference/openai). 

In [74]:
# !pip install exa_py

from openai import OpenAI
from exa_py import Exa


In [83]:
import numpy as np

def get_exa_openai_response(openai_api_key=OPENAI_API_KEY, exa_key=EXA_API_KEY, model="gpt-4o-mini", input_text=None):
    """
    Generate a response using OpenAI GPT-4 via the Exa wrapper. Returns NaN if an error occurs.

    Args:
        openai_api_key (str): The API key for OpenAI.
        exa_key (str): The API key for Exa.
        model (str): The OpenAI model to use (e.g., "gpt-4o-mini").
        input_text (str): The input text to send to the model.

    Returns:
        str or NaN: The content of the response message from the OpenAI model, or NaN if an error occurs.
    """
    try:
        # Initialize OpenAI and Exa clients
        from openai import OpenAI
        from exa_py import Exa
        
        openai = OpenAI(api_key=openai_api_key)
        exa = Exa(exa_key)

        # Wrap OpenAI with Exa
        exa_openai = exa.wrap(openai)

        # Generate a completion (disable tools)
        completion = exa_openai.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": input_text}],
            tools=None  # Ensure tools are not used
        )

        # Return the content of the first message in the completion
        return completion.choices[0].message.content

    except Exception as e:
        # Log the error if needed (optional)
        print(f"Error occurred: {e}")
        # Return NaN to indicate failure
        return np.nan


# Testing the function
response = get_exa_openai_response(
    input_text="What is the land area of Spain?"
)

print(response)


Wrapping OpenAI client with Exa functionality. <class 'function'>
The land area of Spain is approximately 505,370 square kilometers (195,124 square miles). This includes both the mainland and its island territories.


In [84]:
tqdm.pandas()

data['exa_openai_response_parsed'] = data['input_text'].progress_apply(lambda x: get_exa_openai_response(input_text=x))

  0%|          | 0/67 [00:00<?, ?it/s]

Wrapping OpenAI client with Exa functionality. <class 'function'>


  3%|▎         | 2/67 [00:04<02:27,  2.28s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


  4%|▍         | 3/67 [00:08<03:03,  2.86s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


  6%|▌         | 4/67 [00:11<03:14,  3.09s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


  7%|▋         | 5/67 [00:15<03:28,  3.36s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


  9%|▉         | 6/67 [00:19<03:34,  3.52s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 10%|█         | 7/67 [00:20<02:40,  2.68s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 12%|█▏        | 8/67 [00:22<02:36,  2.65s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 13%|█▎        | 9/67 [00:25<02:24,  2.48s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 15%|█▍        | 10/67 [00:28<02:43,  2.86s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 16%|█▋        | 11/67 [00:30<02:22,  2.54s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 18%|█▊        | 12/67 [00:33<02:31,  2.76s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 19%|█▉        | 13/67 [00:37<02:41,  3.00s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 21%|██        | 14/67 [00:42<03:15,  3.69s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 22%|██▏       | 15/67 [00:48<03:47,  4.37s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 24%|██▍       | 16/67 [00:53<03:51,  4.55s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 25%|██▌       | 17/67 [00:58<03:50,  4.62s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 27%|██▋       | 18/67 [01:00<03:02,  3.73s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 28%|██▊       | 19/67 [01:05<03:19,  4.16s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 30%|██▉       | 20/67 [01:09<03:11,  4.08s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 31%|███▏      | 21/67 [01:18<04:18,  5.62s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 33%|███▎      | 22/67 [01:21<03:41,  4.92s/it]

Error occurred: Error code: 400 - {'error': {'message': "An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_8GVRP40X0f03U67uTOZV6GJD", 'type': 'invalid_request_error', 'param': 'messages', 'code': None}}
Wrapping OpenAI client with Exa functionality. <class 'function'>


 34%|███▍      | 23/67 [01:24<03:13,  4.40s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 36%|███▌      | 24/67 [01:28<02:58,  4.15s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 37%|███▋      | 25/67 [01:32<02:53,  4.13s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 39%|███▉      | 26/67 [01:34<02:23,  3.51s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 40%|████      | 27/67 [01:38<02:22,  3.56s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 42%|████▏     | 28/67 [01:43<02:34,  3.97s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 43%|████▎     | 29/67 [01:46<02:23,  3.79s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 45%|████▍     | 30/67 [01:49<02:15,  3.67s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 46%|████▋     | 31/67 [01:56<02:46,  4.63s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 48%|████▊     | 32/67 [02:00<02:35,  4.46s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 49%|████▉     | 33/67 [02:02<02:01,  3.56s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 51%|█████     | 34/67 [02:06<02:00,  3.66s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 52%|█████▏    | 35/67 [02:09<01:52,  3.50s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 54%|█████▎    | 36/67 [02:12<01:47,  3.47s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 55%|█████▌    | 37/67 [02:17<01:58,  3.95s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 57%|█████▋    | 38/67 [02:20<01:46,  3.67s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 58%|█████▊    | 39/67 [02:27<02:11,  4.69s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 60%|█████▉    | 40/67 [02:30<01:53,  4.22s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 61%|██████    | 41/67 [02:31<01:24,  3.23s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 63%|██████▎   | 42/67 [02:33<01:11,  2.88s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 64%|██████▍   | 43/67 [02:37<01:14,  3.11s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 66%|██████▌   | 44/67 [02:40<01:11,  3.09s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 67%|██████▋   | 45/67 [02:48<01:38,  4.47s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 69%|██████▊   | 46/67 [02:52<01:30,  4.33s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 70%|███████   | 47/67 [02:55<01:18,  3.90s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 72%|███████▏  | 48/67 [02:58<01:12,  3.79s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 73%|███████▎  | 49/67 [03:01<01:03,  3.55s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 75%|███████▍  | 50/67 [03:04<00:59,  3.47s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 76%|███████▌  | 51/67 [03:05<00:43,  2.71s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 78%|███████▊  | 52/67 [03:10<00:50,  3.40s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 79%|███████▉  | 53/67 [03:13<00:46,  3.30s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 81%|████████  | 54/67 [03:17<00:42,  3.29s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 82%|████████▏ | 55/67 [03:20<00:39,  3.29s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 84%|████████▎ | 56/67 [03:23<00:36,  3.28s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 85%|████████▌ | 57/67 [03:26<00:32,  3.20s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 87%|████████▋ | 58/67 [03:30<00:31,  3.48s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 88%|████████▊ | 59/67 [03:33<00:26,  3.30s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 90%|████████▉ | 60/67 [03:37<00:23,  3.30s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 91%|█████████ | 61/67 [03:39<00:17,  2.97s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 93%|█████████▎| 62/67 [03:44<00:18,  3.67s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 94%|█████████▍| 63/67 [03:48<00:14,  3.73s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 96%|█████████▌| 64/67 [03:50<00:09,  3.20s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 97%|█████████▋| 65/67 [03:54<00:07,  3.59s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


 99%|█████████▊| 66/67 [04:00<00:04,  4.11s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


100%|██████████| 67/67 [04:05<00:00,  4.53s/it]

Wrapping OpenAI client with Exa functionality. <class 'function'>


100%|██████████| 67/67 [04:07<00:00,  3.69s/it]


In [87]:
data.drop(columns=['feedback', 'label'], inplace=True)

In [88]:
data.to_csv('../data/quotient_natural_qa_with_AI_search_completions.csv', index=False)

In [78]:
data['input_text'][22]

'when did 10 shilling note go out of circulation'

In [89]:
df = pd.read_csv('../data/quotient_natural_qa_with_AI_search_completions.csv')
df.head()

Unnamed: 0,url,input_text,completion,gemini_response,gemini_response_parsed,perplexity_response,perplexity_response_parsed,exa_openai_response_parsed
0,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],response:\nGenerateContentResponse(\n done=...,The title of the person who runs the House of ...,"{'id': '4acced38-a9bb-4f24-8e86-caadf25cab00',...",The title of the person who runs the House of ...,The title of the person who runs the House of ...
1,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",response:\nGenerateContentResponse(\n done=...,"There are several albums and games with ""Riot""...","{'id': '2fe35806-4b8d-45a2-9bd1-3b73f3f86faf',...","The release date for Yo La Tengo's album ""Ther...","Yo La Tengo's album **""There's a Riot Going On..."
2,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",response:\nGenerateContentResponse(\n done=...,The four main hobbits in *The Lord of the Ring...,"{'id': '943301b5-b371-431f-ae8a-ec9a6de4812b',...","In ""The Lord of the Rings: The Fellowship of t...",
3,https://en.wikipedia.org//w/index.php?title=Th...,where does the show the path take place,['Upstate New York'],response:\nGenerateContentResponse(\n done=...,The show *The Path* primarily takes place in t...,"{'id': '214c8f5e-1669-4bc4-ac5a-c9c4ac178cf2',...","The show ""The Path"" takes place in a rural set...",
4,https://en.wikipedia.org//w/index.php?title=Mi...,when did michigan last win a national champion...,['1989'],response:\nGenerateContentResponse(\n done=...,Michigan last won a national championship in t...,"{'id': '5774a4f3-a1a4-4e0b-8a96-7ad9d451e7c8',...",The 1948 Michigan Wolverines football team las...,The Michigan Wolverines last won a national ch...
