# Using `judges` ⚖️ to Build and Leverage LLM Evaluators  


Evaluating the outputs of Large Language Models (LLMs) is often a challenging task, requiring nuanced criteria that are difficult to quantify and even harder to automate. For instance, how do we reliably assess if a model’s response is:  
- factually accurate?  
- concise yet comprehensive?  
- free of hallucinations?  
- aligned with ethical and domain-specific guidelines?  

These questions demand a human-like understanding that traditional metrics like BLEU or ROUGE often fail to capture. Crafting rule-based systems for such evaluations is equally daunting due to the subjective and complex nature of these tasks.  

✅ Enter `judges`: an open-source [library](https://github.com/quotient-ai/judges) that simplifies and streamlines LLM evaluations with pre-built and customizable evaluators, inspired by research-backed LLM-as-a-judge prompts. It’s designed for a wide range of use cases, from factual correctness to hallucination detection, and offers a low-friction interface for both quick setups and advanced customizations.  

💡 The core idea is simple yet transformative: use LLMs themselves to evaluate other LLMs—efficiently, scalably, and with human-like reasoning.

🤖✓ The prompts behind these LLM judges are backed by state-of-the-art research, including influential works such as _"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"_ ([Wang et al., 2023](https://arxiv.org/abs/2306.05685)) and _"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models"_([Hu et al., 2024](https://arxiv.org/abs/2404.18796)).  

In this cookbook, we’ll use a subset of Google’s _Natural Questions_ dataset to demonstrate the use of `judges`. The task involves:  
1. Starting with a pre-annotated dataset where responses are labeled as "good" or "bad" based on Wikipedia content.  
2. Using AI search engines like Perplexity, EXA, and Gemini to generate responses similar to the "good" examples.  
3. Applying `judges` to evaluate these responses for aspects like correctness and hallucination.   

Through this process, you’ll see how `judges` can simplify evaluation workflows while maintaining rigor and scalability. Let’s dive in!

In [None]:
# !pip install judges datasets google-generativeai exa_py

In [4]:
import pandas as pd
from dotenv import load_dotenv
import os
from IPython.display import Markdown, HTML
from tqdm import tqdm

load_dotenv()

True

In [5]:
HF_API_KEY = os.getenv('HF_API_KEY')

if HF_API_KEY:
    !huggingface-cli login --token $HF_API_KEY
else:
    print("Hugging Face API key not found.")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/jamesliounis/.cache/huggingface/token
Login successful


In [6]:
from datasets import load_dataset

dataset = load_dataset("quotientai/labeled-natural-qa-random-100")

data = dataset['train'].to_pandas()

data = data[data['label'] == 'good']

data.head()


Unnamed: 0,url,input_text,completion,label,feedback
1,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],good,
2,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",good,
3,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",good,
4,https://en.wikipedia.org//w/index.php?title=Th...,where does the show the path take place,['Upstate New York'],good,
5,https://en.wikipedia.org//w/index.php?title=Mi...,when did michigan last win a national champion...,['1989'],good,


## Generating answers to our queries using AI Search engines

In this part, we generate answers using search-based AI engines to questions for which we have already categorized the answers as "good". 

You can either set the API keys from a `.env` file, such as what we are doing below, or from Google Colab secrets for which you may use the commented-out commands

In [None]:
PERPLEXITY_API_KEY = os.getenv('PERPLEXITY_API_KEY')

### Gemini

We use the Gemini API with the grounding option, following [official Google documentation](https://ai.google.dev/gemini-api/docs/grounding?lang=python).

In [None]:
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

## Use this if using Colab
#GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

In [10]:
# from google.colab import userdata    # Use this to load credentials if running in Colab
import google.generativeai as genai
from IPython.display import Markdown, HTML

# GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

We first test out the Gemini client to see if everything works as planned!

In [11]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')
response = model.generate_content(contents="What is the land area of Spain?",
                                  tools='google_search_retrieval')

In [25]:
Markdown(response.candidates[0].content.parts[0].text)

Spain's land area covers approximately 500,000 square kilometers.  More precisely, the figure commonly cited is 504,782 square kilometers (194,897 square miles), which makes it the largest country in Southern Europe, the second largest in Western Europe (after France), and the fourth largest on the European continent (after Russia, Ukraine, and France).

Including its island territories—the Balearic Islands in the Mediterranean and the Canary Islands in the Atlantic—the total area increases slightly to around 505,370 square kilometers.  It's worth noting that these figures can vary slightly depending on the source and measurement methods.  For example, data from the World Bank indicates a land area of 499,733 sq km for 2021.  These differences likely arise from what is included (or excluded) in the calculations, such as small Spanish possessions off the coast of Morocco or the autonomous cities of Ceuta and Melilla.


In [12]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')


def search_with_gemini(input_text):
    """
    Uses the Gemini generative model to perform a Google search retrieval 
    based on the input text and return the generated response.

    Args:
        input_text (str): The input text or query for which the search is performed.

    Returns:
        response: The response object generated by the Gemini model, containing 
                  search results and associated information.
    """
    response = model.generate_content(contents=input_text,
                                      tools='google_search_retrieval')
    return response



# Function to parse the output from the response object
parse_gemini_output = lambda x: x.candidates[0].content.parts[0].text


Once we have our functions ready, we run inference on our data. 

In [None]:
tqdm.pandas()

data['gemini_response'] = data['input_text'].progress_apply(search_with_gemini)

100%|██████████| 67/67 [05:04<00:00,  4.54s/it]


In [None]:
# Parse the text output from the response object
data['gemini_response_parsed'] = data['gemini_response'].apply(parse_gemini_output)

### Perplexity

We quickstart the API using [this documentation](https://www.perplexity.ai/hub/blog/introducing-pplx-api).

In [None]:
# PERPLEXITY_API_KEY=userdata.get('PERPLEXITY_API_KEY')

In [71]:
import requests


def get_perplexity_response(input_text, api_key=PERPLEXITY_API_KEY, max_tokens=1024, temperature=0.2, top_p=0.9):
    """
    Sends an input text to the Perplexity API and retrieves a response.

    Args:
        input_text (str): The user query to send to the API.
        api_key (str): The Perplexity API key for authorization.
        max_tokens (int): Maximum number of tokens for the response.
        temperature (float): Sampling temperature for randomness in responses.
        top_p (float): Nucleus sampling parameter.

    Returns:
        dict: The JSON response from the API if successful.
        str: Error message if the request fails.
    """
    url = "https://api.perplexity.ai/chat/completions"

    # Define the payload
    payload = {
        "model": "llama-3.1-sonar-small-128k-online",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant. Be precise and concise."
            },
            {
                "role": "user",
                "content": input_text
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "search_domain_filter": ["perplexity.ai"],
        "return_images": False,
        "return_related_questions": False,
        "search_recency_filter": "month",
        "top_k": 0,
        "stream": False,
        "presence_penalty": 0,
        "frequency_penalty": 1
    }

    # Define the headers
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    # Make the API request
    response = requests.post(url, json=payload, headers=headers)

    # Check and return the response
    if response.status_code == 200:
        return response.json()  # Return the JSON response
    else:
        return f"Error: {response.status_code}, {response.text}"


In [None]:
# Function to parse the text output from the response object
parse_perplexity_output = lambda response: response['choices'][0]['message']['content']

In [73]:
tqdm.pandas()

data['perplexity_response'] = data['input_text'].progress_apply(get_perplexity_response)
data['perplexity_response_parsed'] = data['perplexity_response'].apply(parse_perplexity_output)

100%|██████████| 67/67 [02:12<00:00,  1.98s/it]


### Exa AI

Exa doesn't have an integrated RAG API based on search results the same way that Perplexity and Gemini have. Instead, what they provide is a wrapper around OpenAI, for which we refer to [this documentation](https://docs.exa.ai/reference/openai). 

In [13]:
# !pip install exa_py

from openai import OpenAI
from exa_py import Exa


In [14]:
# # Use this if using Colab
# EXA_API_KEY=userdata.get('EXA_API_KEY')
# OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

EXA_API_KEY = os.getenv('EXA_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [15]:
import numpy as np

from openai import OpenAI
from exa_py import Exa

openai = OpenAI(api_key=OPENAI_API_KEY)
exa = Exa(EXA_API_KEY)

# Wrap OpenAI with Exa
exa_openai = exa.wrap(openai)

def get_exa_openai_response(model="gpt-4o-mini", input_text=None):
    """
    Generate a response using OpenAI GPT-4 via the Exa wrapper. Returns NaN if an error occurs.

    Args:
        openai_api_key (str): The API key for OpenAI.
        exa_key (str): The API key for Exa.
        model (str): The OpenAI model to use (e.g., "gpt-4o-mini").
        input_text (str): The input text to send to the model.

    Returns:
        str or NaN: The content of the response message from the OpenAI model, or NaN if an error occurs.
    """
    try:
        # Initialize OpenAI and Exa clients

        # Generate a completion (disable tools)
        completion = exa_openai.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": input_text}],
            tools=None  # Ensure tools are not used
        )

        # Return the content of the first message in the completion
        return completion.choices[0].message.content

    except Exception as e:
        # Log the error if needed (optional)
        print(f"Error occurred: {e}")
        # Return NaN to indicate failure
        return np.nan


# Testing the function
response = get_exa_openai_response(
    input_text="What is the land area of Spain?"
)

print(response)


Wrapping OpenAI client with Exa functionality. <class 'function'>
The total land area of Spain is approximately 505,370 square kilometers (195,124 square miles).


In [16]:
tqdm.pandas()

data['exa_openai_response_parsed'] = data['input_text'].progress_apply(lambda x: get_exa_openai_response(input_text=x))

 33%|███▎      | 22/67 [01:15<02:50,  3.78s/it]

Error occurred: Error code: 400 - {'error': {'message': "An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_5YAezpf1OoeEZ23TYnDOv2s2", 'type': 'invalid_request_error', 'param': 'messages', 'code': None}}


100%|██████████| 67/67 [04:05<00:00,  3.66s/it]


In [None]:
#data.drop(columns=['feedback', 'label'], inplace=True)

In [None]:
## Saving the data to a CSV file
#data.to_csv('../data/quotient_natural_qa_with_AI_search_completions.csv', index=False)

In [89]:
df = pd.read_csv('../data/quotient_natural_qa_with_AI_search_completions.csv')
df.head()

Unnamed: 0,url,input_text,completion,gemini_response,gemini_response_parsed,perplexity_response,perplexity_response_parsed,exa_openai_response_parsed
0,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],response:\nGenerateContentResponse(\n done=...,The title of the person who runs the House of ...,"{'id': '4acced38-a9bb-4f24-8e86-caadf25cab00',...",The title of the person who runs the House of ...,The title of the person who runs the House of ...
1,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",response:\nGenerateContentResponse(\n done=...,"There are several albums and games with ""Riot""...","{'id': '2fe35806-4b8d-45a2-9bd1-3b73f3f86faf',...","The release date for Yo La Tengo's album ""Ther...","Yo La Tengo's album **""There's a Riot Going On..."
2,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",response:\nGenerateContentResponse(\n done=...,The four main hobbits in *The Lord of the Ring...,"{'id': '943301b5-b371-431f-ae8a-ec9a6de4812b',...","In ""The Lord of the Rings: The Fellowship of t...",
3,https://en.wikipedia.org//w/index.php?title=Th...,where does the show the path take place,['Upstate New York'],response:\nGenerateContentResponse(\n done=...,The show *The Path* primarily takes place in t...,"{'id': '214c8f5e-1669-4bc4-ac5a-c9c4ac178cf2',...","The show ""The Path"" takes place in a rural set...",
4,https://en.wikipedia.org//w/index.php?title=Mi...,when did michigan last win a national champion...,['1989'],response:\nGenerateContentResponse(\n done=...,Michigan last won a national championship in t...,"{'id': '5774a4f3-a1a4-4e0b-8a96-7ad9d451e7c8',...",The 1948 Michigan Wolverines football team las...,The Michigan Wolverines last won a national ch...
