# Section 3.4: Evaluate Your Prompt Templates with LLM-as-Judge

**📍 Progress:** Advanced Section (Optional) | ⭐⭐⭐⭐

| **Aspect** | **Details** |
|-------------|-------------|
| **Goal** | Add an evaluation layer that scores outputs from your prompt templates before they reach production |
| **Time** | ~40 minutes |
| **Prerequisites** | Sections 3.1–3.3 complete, `setup_utils.py` loaded |
| **Level** | **Advanced** - Recommended after mastering 3.2 & 3.3 |
| **What You'll Strengthen** | Trustworthy automation, rubric design, quality gates |
| **Next Steps** | Return to the [Module 3 overview](./README.md) or wire scores into your workflow |

---

> **💡 New to this module?** This is an **advanced optional section**. If you haven't completed Sections 3.2 and 3.3, go back and master those first. This section builds on that foundation.

You just built reusable prompt templates in Sections 3.2 and 3.3. Now you'll learn how to **evaluate those AI outputs** with an LLM-as-Judge so you can accept great responses, request revisions, or escalate risky ones.

## 🔧 Quick Setup Check

Since you completed Section 1, setup is already done! We just need to import it.

In [None]:
# Quick setup check - imports setup_utils
try:
    import importlib
    import setup_utils
    importlib.reload(setup_utils)
    from setup_utils import *
    print(f"✅ Setup loaded! Using {PROVIDER.upper()} with {get_default_model()}")
    print("🚀 Ready to score AI outputs with an LLM judge!")
except ImportError:
    print("❌ Setup not found!")
    print("💡 Please run 3.1-setup-and-introduction.ipynb first to set up your environment.")


## ⚖️ LLM-as-Judge Evaluation Template

### Building the Evaluation Loop for Your Prompt Templates

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">🎯 What You'll Build in This Section</strong><br><br>

You'll create an **LLM-as-Judge rubric** that reviews the output produced by your prompt templates. The judge scores the response, explains its verdict, and tells you whether to accept it, request a revision, or fall back to a human reviewer.
<br><br>
<strong>Time Required:</strong> ~25 minutes (learn, see the example, then try it on your own outputs)
</div>

Layering a judge after your templates keeps quality high without sending everything back to humans. In Session 1 we saw that traditional metrics (F1, BLEU, ROUGE) miss hallucinations and manual reviews are too slow to scale. A rubric-driven LLM judge gives you semantic understanding *and* consistent scoring.

#### 🎯 The Problem We're Solving

1. **🚨 Silent Failures**
   - Template-generated outputs can look polished while hiding factual or security mistakes.
   - Legacy metrics can't flag these issues because they only check surface-level overlap.

2. **⏳ Manual QA Bottlenecks**
   - Human spot checks take days and don't scale to thousands of AI responses.
   - Feedback arrives too late to keep CI/CD pipelines moving.

3. **🎯 Inconsistent Standards**
   - Without a codified rubric, every reviewer (human or AI) applies different criteria.
   - Teams struggle to know when to ship, regenerate, or escalate.

#### 🏗️ How We'll Build It: The Tactical Combination

We chain together Module 2 tactics plus what you learned about judges in Session 1.

| **Tactic** | **Purpose in This Template** | **Why Modern LLMs Need This** |
|------------|------------------------------|-------------------------------|
| **Role Prompting** | Positions the judge as a principal engineer with review authority | Anchors the evaluation in expert expectations instead of generic chat replies |
| **Structured Inputs** | Separates context, rubric, and submission using XML-style tags | Prevents the model from blending instructions with the artifact under review |
| **Rubric Decomposition** | Breaks quality into weighted criteria | Mirrors Session 1 guidance: multi-dimensional scoring avoids naive pass/fail |
| **Chain-of-Thought Justification** | Forces rationale before the decision | Produces auditable feedback and catches hallucinations sooner |
| **Decision Thresholds** | Maps weighted score to Accept / Revise / Reject actions | Gives your pipeline a clear automation hook instead of reading prose |

<div style="margin:16px 0; padding:16px; background:#eef2ff; border-left:5px solid #4338ca; border-radius:8px; color:#1f2937;">
<strong style="font-size:1.05em; color:#1e1b4b;">Reminder from Session 1</strong><br><br>
Relying on a single yes/no question (for example, 'Is this output correct?') lets hidden errors slip through. Weighted rubrics with explicit thresholds give you measurable guardrails.
</div>

### 🤔 Why Add a Judge After Prompt Templates?

- **Detect hidden regressions:** LLM judges evaluate meaning, so paraphrased but wrong answers score poorly even when lexical metrics look fine.
- **Keep automation trustworthy:** A second AI call verifies that template outputs meet the same criteria every time, reducing escalation load.
- **Accelerate iteration:** Scores highlight which tactic block to tweak, letting you A/B test prompts without waiting for human reviewers.

### 📋 LLM-as-Judge Rubric Template

```xml
<role>
You are a Principal Engineer reviewing AI-generated code feedback.
</role>

<rubric>
1. Accuracy (40%): Do identified issues actually exist and are correctly described?
2. Completeness (30%): Are major concerns covered? Any critical issues missed?
3. Actionability (20%): Are recommendations specific and implementable?
4. Communication (10%): Is the review professional, clear, and well-structured?
</rubric>

<instructions>
Score each criterion 1-5 with detailed rationale:
- 5: Excellent - Exceeds expectations
- 4: Good - Meets expectations with minor gaps
- 3: Acceptable - Meets minimum bar
- 2: Needs work - Significant gaps
- 1: Unacceptable - Fails to meet standards

Calculate weighted total: (Accuracy×0.4) + (Completeness×0.3) + (Actionability×0.2) + (Communication×0.1)

Recommend:
- ACCEPT (≥3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements, provide specific guidance
- REJECT (<2.5): Start over with different approach
</instructions>

<submission>
{{llm_output_under_review}}
</submission>

<output_format>
Provide structured evaluation with:
- Individual scores (1-5) with rationale for each criterion
- Weighted total score
- Recommendation (ACCEPT/REVISE/REJECT)
- Specific feedback for improvements
</output_format>
```

#### 🔑 Rubric Design Principles

1. **Weighted Criteria** – Prioritise what matters most (accuracy first for safety-critical domains).
2. **Explicit Scale** – Clear definitions stop the judge from drifting between runs.
3. **Evidence-Based Rationale** – Forces the model to ground scores in the submission.
4. **Actionable Thresholds** – Numeric gates let pipelines auto-approve or request revisions.
5. **Improvement Guidance** – "Revise" outcomes must include next steps for the generator.

#### 🧪 Calibration Framework

The rubric above tells the judge **what** to score; calibration makes sure everyone scores it the **same way**. Treat calibration notes as the companion playbook that keeps your accuracy/completeness/actionability/communication scores aligned across reviewers and over time.

Instead of generic "7/10 - pretty good" language, define what each score means. For example, **7/10 = factually accurate with minor gaps, clear structure, appropriate for the target audience, but missing one or two implementation details.**

#### 🛠️ Use-Case Calibration Examples

Tie calibration back to your weighted criteria: the examples below show how different score levels reflect accuracy, completeness, actionability, and communication in a documentation context.

| Scenario | 9/10 | 5/10 | 2/10 |
|----------|------|------|------|
| Technical documentation | Complete, tested, and handles edge cases | Covers main flows, some gaps in error handling | Only basic concepts, missing implementation details |

#### 📏 Calibration Best Practices

- **Anchor scores:** Use real examples for every score level so the judge can compare and map them back to the rubric criteria.
- **Regular recalibration:** Review rubrics quarterly with domain experts and adjust thresholds or weights as standards evolve.
- **Inter-rater reliability:** Have multiple calibrators score the same samples to confirm they interpret the rubric the same way.



### 💻 Working Example: Judge the Section 3.2 Code Review

> **Note:** To avoid the AI model grading its own review or automatically preferring its own output, we switch the judge to a different provider/model so the evaluation comes from an independent model.

This cell replays the Section 3.2 template to generate the comprehensive AI review, then immediately scores it with the judge using the same monthly report diff.

**What you'll see:**
- The full AI review that the template produces
- How the rubric weights accuracy, completeness, actionability, and communication
- An Accept/Revise/Reject recommendation tied to the numeric thresholds

<div style="margin-top:16px; padding:16px; background:#fef3c7; border-left:4px solid #f59e0b; border-radius:8px; color:#78350f;">
<strong>⚠️ Heads-up:</strong> <br>
The next cell first replays the Section 3.2 prompt template to regenerate the AI review, then runs the LLM-as-Judge rubric on that fresh output.
</div>

<div style="margin-top:16px; color:#991b1b; padding:12px; background:#fee2e2; border-radius:6px; border-left:4px solid #ef4444;">
<strong>⚠️ IMPORTANT:</strong><br>
To avoid the AI model grading its own review or automatically preferring its own output, we switch the judge to a different provider/model so the evaluation comes from an independent model.
</div>


In [None]:
# Example: Judge the Section 3.2 code review output

code_diff = '''
+ import json
+ import time
+ from decimal import Decimal
+
+ CACHE = {}
+
+ def generate_monthly_report(org_id, db, s3_client):
+     if org_id in CACHE:
+         return CACHE[org_id]
+
+     query = f"SELECT * FROM invoices WHERE org_id = '{org_id}' ORDER BY created_at DESC"
+     rows = db.execute(query)
+
+     total = Decimal(0)
+     items = []
+     for row in rows:
+         total += Decimal(row['amount'])
+         items.append({
+             'id': row['id'],
+             'customer': row['customer_name'],
+             'amount': float(row['amount'])
+         })
+
+     payload = {
+         'org': org_id,
+         'generated_at': time.strftime('%Y-%m-%d %H:%M:%S'),
+         'total': float(total),
+         'items': items
+     }
+
+     key = f"reports/{org_id}/{int(time.time())}.json"
+     time.sleep(0.5)
+     s3_client.put_object(
+         Bucket='company-reports',
+         Key=key,
+         Body=json.dumps(payload),
+         ACL='public-read'
+     )
+
+     CACHE[org_id] = key
+     return key
'''

review_messages = [
    {
        "role": "system",
        "content": "You follow structured review templates and produce clear, actionable findings."
    },
    {
        "role": "user",
        "content": f"""
<role>
Act as a Senior Software Engineer specializing in Python backend services.
Your expertise covers security best practices, performance tuning, reliability, and maintainable design.
</role>

<context>
Repository: analytics-platform
Service: Reporting API
Purpose: Add a monthly invoice report exporter that finance can trigger
Change Scope: Review focuses on the generate_monthly_report implementation
Language: python
</context>

<code_diff>
{code_diff}
</code_diff>

<review_guidelines>
Assess the change across multiple dimensions:

1. Security — SQL injection, S3 object exposure, sensitive data handling.
2. Performance — query efficiency, blocking calls, caching behaviour.
3. Error Handling — resilience to empty results, network/storage failures.
4. Code Quality — readability, global state, data conversions.
5. Correctness — totals, currency precision, repeated report generation.
6. Best Practices — configuration management, separation of concerns, testing hooks.
For each finding, cite the diff line, describe impact, and share an actionable fix.
</review_guidelines>

<tasks>
Step 1 - Think: Analyse the diff using the dimensions listed above.
Step 2 - Assess: For each issue, capture Severity (CRITICAL/MAJOR/MINOR/INFO), Category, Line, Issue, Impact.
Step 3 - Suggest: Provide a concrete remediation (code change or process tweak).
Step 4 - Verdict: Summarise overall risk and recommend APPROVE / REQUEST CHANGES / NEEDS WORK.
</tasks>

<output_format>
## Code Review Summary
[One paragraph on overall health and primary risks]

## Findings
### [SEVERITY] Issue Title
**Category:** [Security / Performance / Quality / Correctness / Best Practices]
**Line:** [line number]
**Issue:** [impact-focused description]
**Recommendation:**
```
# safer / faster / cleaner fix here
```

## Overall Assessment
**Recommendation:** [APPROVE | REQUEST CHANGES | NEEDS WORK]
**Summary:** [What to address before merge]
</output_format>
"""
    },
]

print("🔍 Generating the Section 3.2 code review...")
print(f"Using {PROVIDER.upper()} with {get_default_model()}")
print("=" * 70)
ai_generated_review = get_chat_completion(review_messages, temperature=0.0)
print(ai_generated_review)
print("=" * 70)

rubric_prompt = """
<context>
Original pull request diff:
{context}

AI-generated review to evaluate:
{ai_output}
</context>

<rubric>
1. Accuracy (40%): Do identified issues actually exist and are correctly described?
2. Completeness (30%): Are major concerns covered? Any critical issues missed?
3. Actionability (20%): Are recommendations specific and implementable?
4. Communication (10%): Is the review professional, clear, and well-structured?
</rubric>

<instructions>
Score each criterion 1-5 with detailed rationale.
Calculate weighted total: (Accuracy×0.4) + (Completeness×0.3) + (Actionability×0.2) + (Communication×0.1)

Recommend:
- ACCEPT (≥3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements  
- REJECT (<2.5): Unacceptable quality
</instructions>

Provide structured evaluation with scores, weighted total, recommendation, and feedback.
"""

judge_messages = [
    {"role": "system", "content": "You are a Principal Engineer reviewing AI-generated code feedback."},
    {"role": "user", "content": rubric_prompt.format(context=code_diff, ai_output=ai_generated_review)}
]

original_provider = setup_utils.PROVIDER
try:
    setup_utils.PROVIDER = 'openai'
    print("⚖️ JUDGE EVALUATION IN PROGRESS...")
    print(f"Using {PROVIDER.upper()} with {get_default_model()}")
    print("=" * 70)
    judge_result = get_chat_completion(judge_messages, temperature=0.0)
    print(judge_result)
    print("=" * 70)
finally:
    setup_utils.PROVIDER = original_provider


## 🏋️ Activity 3.4: Create Your Judge Template

**Now it's your turn!** Complete Activity 3.4 to build your own judge template.

### 📝 Instructions

1. **Open the activity file**: `activities/activity-3.4-llm-as-judge-evaluation.md`
2. **Edit the template**: Replace all `TODO` comments with your scoring criteria
3. **Come back here**: Run the cells below to test your template
4. **Iterate**: Refine your template based on the results

**What you're building**: A judge that evaluates the cache refactor scenario against 4 weighted criteria (correctness, design, safety, tests) and outputs Accept/Revise/Block.

<div style="margin-top:16px; color:#991b1b; padding:12px; background:#fee2e2; border-radius:6px; border-left:4px solid #ef4444;">
<style>
code {
  font-family: Consolas,"courier new";
  color:rgb(238, 13, 13);
  background-color: #f1f1f1;
  padding: 2px;
  font-size: 110%;
}
</style>
<strong>⚠️ COMPLETE THE ACTIVITY FIRST:</strong><br>
Before running the cells below, you must:
<ol style="margin: 8px 0 0 0;">
<li>Open <code>activities/activity-3.4-llm-as-judge-evaluation.md</code></li>
<li>Replace all <code>TODO</code> comments in the template (between <code>&lt;!-- TEMPLATE START --&gt;</code> and <code>&lt;!-- TEMPLATE END --&gt;</code>)</li>
<li>Save the file</li>
<li>Return here to test your template</li>
</ol>
</div>

<div style="margin-top:16px; color:#78350f; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b;">
<strong>💡 What the test provides:</strong><br>
The <code>test_activity_3_4()</code> function loads your template and fills in all the scenario details (code_before, code_after, refactor rationale, etc.). You just need to define the scoring criteria!
</div>

### 🔁 Test Your Judge Template

Run the cell below to test your completed template. This loads your template from the activity file and evaluates the cache refactor scenario.

In [None]:
# Test your judge template with the cache refactor scenario
from setup_utils import test_activity_3_4, get_refactor_judge_scenario

print("🧪 Testing your judge template from activity-3.4-llm-as-judge-evaluation.md...")
print("=" * 70)
judge_preview = test_activity_3_4(variables=get_refactor_judge_scenario())
print("\n" + "=" * 70)
print("\n💡 Review the verdict above. Does it match your expectations?")
print("   - If TODOs remain, complete your template in the activity file")
print("   - If scores seem off, adjust your criteria and re-run this cell")
print("   - To see the reference solution, check solutions/activity-3.4-judge-solution.md")

In [None]:
# Optional: Test with custom scenario
# 
# If you want to test your judge with different code, modify the variables below
# and run this cell. Otherwise, the cell above tests with the standard scenario.

from setup_utils import test_activity_3_4

custom_variables = {
    'service_name': 'TODO - Your Service Name',
    'refactor_brief': 'TODO - What was refactored?',
    'code_before': """
# TODO: Paste original code here
""",
    'code_after': """
# TODO: Paste refactored code here
""",
    'refactor_goal': 'TODO - What was the goal?',
    'test_summary': 'TODO - Test results',
    'analysis_findings': 'TODO - Linter/static analysis output',
    'critical_regression': 'TODO - Any known regression?',
    'security_findings': 'TODO - Security scan results',
    'escalation_channel': '#your-channel',
    'ai_refactor_output': """
# TODO: Paste the AI's explanation of the refactor
"""
}

print("🧪 Testing with custom scenario...")
print("⚠️ Make sure to replace all TODO values above before running!")
print("=" * 70)
judge_result = test_activity_3_4(variables=custom_variables)

---
### 🚀 What's Next: From Manual Judging to Systematic Evals

**You just tested your judge on one scenario.** To use this in production, you need **systematic evaluations** that track judge performance over time.

#### Why Evals Matter

Manual testing validates one case. **Evals** validate your judge across hundreds of cases and track metrics:
- **Accuracy**: Does your judge correctly identify good vs bad refactors?
- **False positives**: How often does it block acceptable changes?
- **Consistency**: Does rubric v2 improve on v1?

**Learn why evals are critical:** [Why LLM Evals Matter](https://www.youtube.com/watch?v=vygFgCNR7WA&list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S)

#### Production Eval Platforms

Scale your judge with evaluation platforms:

- **[OpenAI Platform Evals](https://platform.openai.com/docs/guides/evals)**: Dashboard-based systematic evaluation with datasets and metrics
- **[Anthropic Evaluation Tool](https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool)**: Console-based prompt testing with side-by-side comparison

#### Quick Start

1. **Build an eval dataset**: Collect 10-20 refactors with known verdicts
2. **Run systematic evals**: Test your judge template against the dataset
3. **Track metrics**: Measure accuracy, iterate on rubric weights
4. **Compare models**: Test if GPT-4o vs Claude performs better as judge

---

### 📚 Learn More: Production-Ready Evaluation Patterns

- [LLM as a Judge: Scaling AI Evaluation Strategies](https://youtu.be/trfUBIDeI1Y?si=mxwrME9l3KcpZNPj) - How LLM as a judge can scale and refine evaluations with strategies like direct assessment and pairwise comparison.
- [LLM-as-a-Judge: Rethinking Model-Based Evaluations in Text Generation](https://leehanchung.github.io/blogs/2024/08/11/llm-as-a-judge/) - Analyzes the evolution of text generation evaluation methods, from traditional approaches to LLM-as-a-Judge
- [The challenges in using LLM-as-a-Judge](https://youtu.be/vBJF2sy1Pyw?si=S5IsgIOu0dzASUbH) - Learn how to use LLM-based evaluations effectively, understand associated challenges and look at what lies beyond evaluation.
- [LLM-as-a-judge on Amazon Bedrock Model Evaluation](https://aws.amazon.com/blogs/machine-learning/llm-as-a-judge-on-amazon-bedrock-model-evaluation/) - How to implement and use LLM-as-a-judge capability within Amazon Bedrock Model Evaluation
- [Anthropic: Using the Evaluation Tool](https://docs.claude.com/en/docs/test-and-evaluate/eval-tool) - Evaluate prompts in the developer console.
- [OpenAI Docs: Graders](https://platform.openai.com/docs/guides/graders) - Graders are a way to evaluate your model's performance against reference answers.
- [OpenAI DevDay 2024 | Balancing accuracy, latency, and cost at scale](https://youtu.be/Bx6sUDRMx-8) — LLM optimization
- [Session 1 Recap](../../session_1_introduction_and_basics.ipynb) - Revisit why automated metrics alone miss hallucinations.

## ✅ Section 3.4 Complete!

<div style="margin-top:16px; padding:14px; background:#dcfce7; border-left:4px solid #22c55e; border-radius:6px; color:#065f46;">
<strong>🎉 Outstanding work!</strong> You've completed the advanced LLM-as-Judge section and mastered all of Module 3!
</div>

**Key takeaways:**
- Built weighted rubrics to evaluate AI-generated outputs
- Learned to set automated decision thresholds (Accept/Revise/Block)
- Discovered how to scale from manual testing to systematic evals

### 🎊 Module 3 Complete!

You've now mastered:
- ✅ **Code Review Automation** (Section 3.2)
- ✅ **Test Generation Automation** (Section 3.3)
- ✅ **LLM-as-Judge Evaluation** (Section 3.4)

### ⏭️ Next Steps

**Ready to integrate?** Continue to **Module 4: Integration** to learn how to:
- Integrate your templates into GitHub Copilot, OpenAI Codex, and Claude Code
- Build custom commands and workflows for AI code assistants
- Operationalize prompt engineering across your team

**Want to apply what you learned?**
- Use your judge on Activities 3.2 and 3.3 outputs
- Build eval datasets to track judge performance over time
- Integrate quality gates into your CI/CD pipeline