<a href="https://colab.research.google.com/github/yellowleather/LLM-Fundamentals/blob/main/agenteval_llm_judge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ§ª LLM Judge Homework: Step-by-Step Evaluation
In this assignment, you'll incrementally build an LLM-based judge to compare two summaries.

You'll go through the following steps:
1. Base judgment: which is better overall?
2. Add rubric: accuracy, coverage, clarity
3. Add explanations per rubric
4. Add chain-of-thought reasoning

In [None]:
!pip install openai



In [None]:
import openai
from getpass import getpass
openai.api_key = getpass('Enter your OpenAI API key: ')
client = openai.OpenAI(api_key=openai.api_key)

Enter your OpenAI API key: Â·Â·Â·Â·Â·Â·Â·Â·Â·Â·


## ðŸ§¾ Example Set

In [None]:
examples = [
    {
        "id": "ex1",
        "context": "The UN released a report warning of global temperature rise and called for urgent international action.",
        "summary_a": "The UN warned that climate change is worsening and action is needed.",
        "summary_b": "The UN praised global progress in reducing emissions."
    },
    {
        "id": "ex2",
        "context": "NASA launched Artemis I, an uncrewed spacecraft that will orbit the Moon and return to Earth, preparing for human missions.",
        "summary_a": "NASA launched Artemis I to prepare for future Moon missions.",
        "summary_b": "NASA's Artemis I failed to launch due to engine problems."
    },
    {
        "id": "ex3",
        "context": "A study found intermittent fasting improves blood sugar and cholesterol levels.",
        "summary_a": "Intermittent fasting improves health markers like blood sugar and cholesterol.",
        "summary_b": "Fasting was linked to poor nutrition and increased blood pressure."
    }
]

## ðŸ”¹ Step 1: Base Judgment â€“ A or B?
No rubric, just pick the better one and explain.

In [None]:
def judge_base(context, summary_a, summary_b,client):
    prompt = f'''
You're evaluating two summaries of an article.

Article:
\"\"\"{context}\"\"\"

Summary A:
\"\"\"{summary_a}\"\"\"

Summary B:
\"\"\"{summary_b}\"\"\"

Which one is better and why? Reply in JSON:
{{
  "final_answer": "A" or "B",
  "explanation": "..."
}}
'''
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
import json
def run_judge_base(client):
    print("Running Step 1: Base Judgment")
    for ex in examples:
        result = judge_base(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_base(client)

Running Step 1: Base Judgment
```json
{
  "final_answer": "A",
}
```
```json
{
  "final_answer": "A",
  "explanation": "Summary A accurately captures the main point of the article, which is that NASA successfully launched Artemis I to prepare for future Moon missions. Summary B is incorrect as it states that Artemis I failed to launch due to engine problems, which contradicts the information provided in the article."
}
```
```json
{
  "final_answer": "A",
  "explanation": "Summary A accurately reflects the findings of the article by stating that intermittent fasting improves health markers such as blood sugar and cholesterol. Summary B, on the other hand, introduces information that contradicts the article by suggesting fasting is linked to poor nutrition and increased blood pressure, which is not supported by the original text."
}
```


## ðŸ”¹ Step 2: Add Rubric Dimensions â€“ Accuracy, Coverage, Clarity

In [None]:
def judge_with_rubric(context, summary_a, summary_b, client):
    prompt = f'''
Evaluate two summaries using this rubric (0â€“10 per dimension):
- Accuracy
- Coverage
- Clarity

Article:
\"\"\"{context}\"\"\"

Summary A:
\"\"\"{summary_a}\"\"\"

Summary B:
\"\"\"{summary_b}\"\"\"

Respond in JSON:
{{
  "summary_a": {{"accuracy": int, "coverage": int, "clarity": int}},
  "summary_b": {{"accuracy": int, "coverage": int, "clarity": int}},
  "final_answer": "A" or "B"
}}
'''
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


In [None]:
import json
def run_judge_base(client):
    print("Running Step 1: Base Judgment")
    for ex in examples:
        result = judge_with_rubric(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_base(client)

Running Step 1: Base Judgment
```json
{
  "summary_a": {"accuracy": 9, "coverage": 8, "clarity": 9},
  "summary_b": {"accuracy": 2, "coverage": 2, "clarity": 8},
  "final_answer": "A"
}
```
```json
{
  "summary_a": {"accuracy": 10, "coverage": 6, "clarity": 9},
  "summary_b": {"accuracy": 0, "coverage": 2, "clarity": 8},
  "final_answer": "A"
}
```
```json
{
  "summary_a": {"accuracy": 10, "coverage": 10, "clarity": 10},
  "summary_b": {"accuracy": 0, "coverage": 0, "clarity": 5},
  "final_answer": "A"
}
```


## ðŸ”¹ Step 3: Add Explanations per Rubric Dimension

In [None]:
def judge_with_rubric_expl(context, summary_a, summary_b, client):
    one_shot = '''
Example:

Article:
"""The Mars rover Perseverance successfully collected its first rock samples, which will help scientists study signs of ancient life on Mars."""

Summary A:
"""Perseverance gathered its first rock samples, advancing the mission to study past Martian life."""

Summary B:
"""NASAâ€™s rover was unable to collect samples due to a drilling malfunction."""

Evaluation:
{
  "summary_a": {
    "accuracy": 9,
    "accuracy_explanation": "Accurately reflects that the rover collected samples and the purpose of the mission.",
    "coverage": 9,
    "coverage_explanation": "Covers both the sampling and the scientific goal.",
    "clarity": 9,
    "clarity_explanation": "Clear and concise summary with no ambiguity."
  },
  "summary_b": {
    "accuracy": 2,
    "accuracy_explanation": "Factually incorrect â€” the rover did collect samples successfully.",
    "coverage": 3,
    "coverage_explanation": "Misses all key points from the article and introduces incorrect information.",
    "clarity": 7,
    "clarity_explanation": "Somewhat clear, but misleading due to false claims."
  },
  "final_answer": "A"
}

---
'''

    prompt = f'''
{one_shot}

Now evaluate the following:

Article:
\"\"\"{context}\"\"\"

Summary A:
\"\"\"{summary_a}\"\"\"

Summary B:
\"\"\"{summary_b}\"\"\"

}}
'''

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": "You are a strict JSON evaluator for summarization quality using rubric dimensions."},
                  {"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content.strip()


In [None]:
def run_judge_rubric_expl(client):
    print("Running Step 3: Rubric + Explanation")
    for ex in examples:
        result = judge_with_rubric_expl(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_rubric_expl(client)


Running Step 3: Rubric + Explanation
{
  "summary_a": {
    "accuracy": 8,
    "coverage": 7,
    "clarity": 9,
    "clarity_explanation": "Clear and easy to understand, with a concise summary of the article's main points."
  },
  "summary_b": {
    "accuracy": 2,
    "coverage": 2,
    "coverage_explanation": "Fails to address the main content of the article, instead providing an incorrect statement.",
    "clarity": 8,
    "clarity_explanation": "While the sentence is clear in its language, it completely misrepresents the article's content."
  },
  "final_answer": "A"
}
{
  "summary_a": {
    "accuracy": 9,
    "accuracy_explanation": "Accurately reflects the successful launch of Artemis I and its purpose in preparing for future missions.",
    "coverage": 8,
    "coverage_explanation": "Mentions the launch and intent but lacks the detail about orbiting the Moon and returning to Earth.",
    "clarity": 9,
    "clarity_explanation": "Clear and concise, providing the main purpose witho

## ðŸ”¹ Step 4: Add Chain-of-Thought Reasoning

In [None]:
def judge_chain_of_thought(context, summary_a, summary_b, client):
    prompt = f"""

Now evaluate:

Article:
\"\"\"{context}\"\"\"

Summary A:
\"\"\"{summary_a}\"\"\"

Summary B:
\"\"\"{summary_b}\"\"\"

Reason step-by-step through **each dimension** before assigning scores.

Respond only in JSON:
{{
  "reasoning": "...",
  "summary_a": {{"accuracy": int, "coverage": int, "clarity": int}},
  "summary_b": {{"accuracy": int, "coverage": int, "clarity": int}},
  "final_answer": "A" or "B"
}}
"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a rubric-based evaluator that reasons step-by-step before scoring."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content.strip()


In [None]:
def run_judge_chain_of_thought(client):
    print("Running Step 4: Chain-of-Thought Judgment")
    for ex in examples:
        result = judge_chain_of_thought(ex['context'], ex['summary_a'], ex['summary_b'],client)
        print(result)

run_judge_chain_of_thought(client)

Running Step 4: Chain-of-Thought Judgment
```json
{
  "summary_a": {"accuracy": 5, "coverage": 5, "clarity": 5},
  "summary_b": {"accuracy": 1, "coverage": 1, "clarity": 3},
  "final_answer": "A"
}
```
{
  "reasoning": "To evaluate summaries, I will consider the dimensions of accuracy, coverage, and clarity. \n\n1. **Accuracy:** This measures whether the information in the summary truly reflects the details in the source article.\n   - **Summary A**: It accurately states that NASA launched Artemis I and that it's for preparing future missions, which aligns with the article.\n   - **Summary B**: The claim that Artemis I failed to launch due to engine problems is incorrect according to the article.\n\n2. **Coverage:** This evaluates whether the summary captures significant details from the article.\n   - **Summary A**: It covers the key point that Artemis I was launched to prepare for future missions, but it doesnâ€™t mention that Artemis I is an uncrewed spacecraft that will orbit the M