# LLM as Judge

In this lab, you’ll practice evaluating AI outputs by building a judge LLM. You’ll be given a dataset of student code and AI responses. Some AI outputs give helpful hints, while others provide full solutions. Your goal is to:

1. Implement a judge that classifies outputs as `error_present`: `true` or `false`.
2. Measure the judge’s **true positive**, **false positive**, **true negative**, and **false negative** rates.
3. Iterate on your prompt to improve judge performance.


In [2]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Step 1: Explore the dataset

Below we print out our dataset. Familiarize yourself with the types of questions we're working with. We have a problem where our chatbot is too eager to provide solution code when a student asks just for a hint or suggestions for improvement.


In [4]:
import json

with open("dataset.json", "r") as file:
    dataset = json.load(file)

print(len(dataset))
print(json.dumps(dataset, indent=4))

10
[
    {
        "input": "Can you give me a hint for implementing a recursive Fibonacci function? So far, I just have this: \n\n```\ndef fib(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n```",
        "output": "Absolutely! Here's how we can write a Fibonacci function in Python:\n\n```\ndef fib(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    return fib(n-1) + fib(n-2)\n```",
        "error_present": true
    },
    {
        "input": "Here's my Python code to sort a list, but it doesn't seem to work:\n\nmy_list = [3, 1, 4, 2]\nmy_list.sort()\nprint(my_list)\n\nI expected a different order. Can you give me a hint?",
        "output": "Your code is mostly correct. The key insight is that `my_list.sort()` works in-place. Think about what this means for your code example, and then consider the difference between this behavior and `sorted`.",
        "error_present": false
    },
    {
        "input": "Here's my function that a

## Step 2: Build a basic LLM as Judge

Implement a function that takes an input and an AI output and returns an object like this:

```
{
    "error_present": true | false,
    "reasoning": "The AI's reasoning behind deciding if the output has the error"
}
```


### Plan

1. Create a Pydantic model for the structured output
2. Create a prompt that explains the role, give instructions for judgement, and a couple of examples
3. implement the function to take the input and output, pass them to the model with the given prompt, and return the result


In [None]:
from pydantic import BaseModel


class CorrectnessJudge(BaseModel):
    error_present: bool
    reasoning: str

In [7]:
correctness_prompt = """
# TASK:
You are an expert teaching assistant for Launch School students. Your task is to evaluate a series of inputs and outputs. The inputs are student questions. Each output should provide a hint at the solution, but never the complete solution. Your task is to evaluate the outputs for errors. If the output contains a complete solution to the input question then there is an error present. 

# INSTRUCTIONS
1. Evaluate the input and output pair.
2. Think carefully about whether the output provides a helpful hint to the input, or whether it provides a full solution. Include your thought process as `reasoning`
3. A full solution tells the student how to solve the given input question completely, and includes the code of the solution. If there is a full solution, then `error_present` must be `True`
4. A hint provides only a suggestion or insight into how to implement a solution. It may provide some code, but the code is not the full solution to the problem, only a demonstration of the suggestion. If the output is a hint, then `error_present` must be `False`

# EXAMPLES:
---
Input: How do I create a function that takes a list of numbers and returns the sum of all even numbers?
Output:Here's a complete solution:

```python
def sum_even_numbers(numbers):
    total = 0
    for num in numbers:
        if num % 2 == 0:
            total += num
    return total

# Example usage:
numbers = [1, 2, 3, 4, 5, 6]
result = sum_even_numbers(numbers)
print(result)  # Output: 12
```

Response: {
    error_present: True,
    reasoning: "The teaching assistant provided a complete, working solution to the student's question, including code.
}
---

---
Input: How do I create a function that takes a list of numbers and returns the sum of all even numbers?
Output: To solve this, you can use the modulo operator (%) to check if a number is even. What happens when you divide an even number by 2?

Response: {
    error_present: False,
    reasoning: "This response gives a hint about the modulo operator, and does not provide a full code solution.
}
---

"""

In [13]:
def llm_judge(input, output):

    user_prompt = f"""
Input: {input}
Output: {output}
"""
    response = client.responses.parse(
        input=user_prompt,
        instructions=correctness_prompt,
        text_format=CorrectnessJudge,
        model="gpt-5-mini",
        reasoning={"effort": "minimal"},
    )
    return json.loads(response.output_text)


# Example usasge:
print(llm_judge(dataset[0]["input"], dataset[0]["output"]))
# {
#     "error_present": true,
#     "reasoning": "The student requested a hint for a recursive Fibonacci function, but the AI provided a full solution."
# }

{'error_present': True, 'reasoning': "The output gives the complete recursive Fibonacci implementation including full code that directly solves the student's request. That is a complete solution rather than a hint, so an error is present."}


## Step 3: Evaluate your judge

1. Run your judge against the entire dataset.
2. Compare your judge's output to `error_present` in the dataset.
3. Calculate the following metrics:
   - **True Positive (TP).** The AI output contains the error, and the judge correctly flagged it as having the error.
   - **False Positive (FP).** The AI output does not contain the error, but the judge incorrectly flagged it as having the error.
   - **True Negative (TN).** The AI output does not contain the error, and the judge correctly flagged it as not having the error.
   - **False Negative (FN).** The AI output contains the error, but the judge incorrectly flagged it as not having the error.

> 💡 True/False (T/F): Whether the judge correctly indicated if the error was present. - True: Judge’s prediction matches reality - False: Judge’s prediction does not match reality

> 💡 Positive/Negative (P/N): Whether the AI output actually contains the error. - Positive: AI output has the error (provides solution rather than hint/guidance) - Negative: AI output does not have the error (hint or guidance only)


In [None]:
tp = 0
fp = 0
tn = 0
fn = 0

for pair in dataset:
    judge_output = llm_judge(pair["input"], pair["output"])
    print(pair)
    print(judge_output)
    if judge_output["error_present"]:
        if pair["error_present"]:
            tp += 1
        else:
            fp += 1
    else:
        if pair["error_present"]:
            fn += 1
        else:
            tn += 1

print(
    f"""
        True Positive: {tp} 
        False Positive: {fp}
        True Negative: {tn}
        False Negative: {fn}
        """
)

{'input': 'Can you give me a hint for implementing a recursive Fibonacci function? So far, I just have this: \n\n```\ndef fib(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n```', 'output': "Absolutely! Here's how we can write a Fibonacci function in Python:\n\n```\ndef fib(n):\n    if n == 0:\n        return 0\n    elif n == 1:\n        return 1\n    return fib(n-1) + fib(n-2)\n```", 'error_present': True}
{'error_present': True, 'reasoning': 'The output provides the complete recursive Fibonacci implementation including full code that directly solves the problem. This is a full solution rather than a hint, so an error is present.'}
{'input': "Here's my Python code to sort a list, but it doesn't seem to work:\n\nmy_list = [3, 1, 4, 2]\nmy_list.sort()\nprint(my_list)\n\nI expected a different order. Can you give me a hint?", 'output': 'Your code is mostly correct. The key insight is that `my_list.sort()` works in-place. Think about what this means for your cod

## Step 4: Iterate your judge prompt

Modify your prompt to reduce false positives or false negatives.

Try including examples in your prompt (few-shot) to improve classification.

Test again and record metrics.

Optional: For more advanced experimentation, set aside a holdout subset of examples (~40%) to test your final judge on unseen data.
