# Section 3.4: Evaluate Your Prompt Templates

| **Aspect** | **Details** |
|-------------|-------------|
| **Goal** | Add an evaluation layer that scores outputs from your prompt templates before they reach production |
| **Time** | ~40 minutes |
| **Prerequisites** | Sections 3.1‚Äì3.3 complete, `setup_utils.py` loaded |
| **Level** | **Advanced** - Recommended after mastering 3.2 & 3.3 |
| **What You'll Strengthen** | Trustworthy automation, rubric design, quality gates |
| **Next Steps** | Return to the [Module 3 overview](./README.md) or wire scores into your workflow |

---

> **üí° New to this module?** This is an **advanced optional section**. If you haven't completed Sections 3.2 and 3.3, go back and master those first. This section builds on that foundation.

You just built reusable prompt templates in Sections 3.2 and 3.3. Now you'll learn how to **evaluate those AI outputs** with weighted rubrics so you can accept great responses, request revisions, or escalate risky ones.

## Quick Setup Check

Since you completed Section 1, setup is already done! We just need to import it.

In [None]:
# Quick setup check - imports setup_utils
try:
    import importlib
    import setup_utils
    importlib.reload(setup_utils)
    from setup_utils import *
    print(f"‚úÖ Setup loaded! Using {get_provider().upper()} with {get_default_model()}")
    print("üöÄ Ready to score AI outputs with evaluation rubrics!")
except ImportError:
    print("‚ùå Setup not found!")
    print("üí° Please run 3.1-setup-and-introduction.ipynb first to set up your environment.")

## Evaluation Template

### Building the Evaluation Loop for Your Prompt Templates

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">üéØ What You'll Build in This Section</strong><br><br>

You'll create an **evaluation rubric** that reviews the output produced by your prompt templates. The rubric scores the response, explains its verdict, and tells you whether to accept it, request a revision, or fall back to a human reviewer.
<br><br>
<strong>Time Required:</strong> ~40 minutes (learning + examples + activity)
</div>

Layering an evaluation rubric after your templates keeps quality high without sending everything back to humans. In Module 2 we learned that traditional metrics (F1, BLEU, ROUGE) miss hallucinations and manual reviews are too slow to scale. A rubric-driven evaluation gives you semantic understanding *and* consistent scoring.

---

#### ü§î Quick Reflection: Your Quality Assurance Experience

Before we dive in, take a moment to reflect on your own experience:

<div style="background:#e0f2fe; border-left:4px solid #0284c7; padding:16px; border-radius:6px; margin:16px 0; color:#000000;">
<strong style="color:#0c4a6e;">üí≠ Think about the last time you reviewed AI-generated content or code:</strong><br><br>

**Question 1:** How did you decide if the output was "good enough"? Gut feeling? Checklist?<br>
**Question 2:** Did different reviewers accept/reject the same output differently?<br>
**Question 3:** Could you articulate why you approved or rejected it to someone else?
</div>

---

#### The Problems We're Solving Together

Sound familiar? Let's connect these challenges to real scenarios you've probably experienced:

<div style="background:#fff; border:2px solid #e5e7eb; padding:16px; border-radius:8px; margin:16px 0; color:#000000;">

**1. üö® Silent Failures**

<div style="margin-left:16px; padding:12px; background:#fef3c7; border-radius:6px; margin-top:8px; margin-bottom:16px;">
<strong>üîç Spot This Pattern?</strong><br><br>
‚Ä¢ Your AI code review template flags 5 security issues<br><br>
‚Ä¢ You merge it, deploy to production<br><br>
‚Ä¢ Later: Customer reports a security vulnerability the AI review mentioned but... got the line numbers wrong<br><br>
‚Ä¢ Result: The review looked comprehensive but had factual errors no one caught<br><br>
<strong>Result:</strong> Polished-looking output with hidden mistakes that traditional metrics can't detect.
</div>

**Real impact:**
- AI-generated code reviews that miss critical bugs
- Documentation that looks complete but has incorrect examples
- Test specifications that skip edge cases
- **Your risk:** How do you know if AI output is production-ready?

---

**2. ‚è≥ Manual QA Bottlenecks**

<div style="margin-left:16px; padding:12px; background:#fef3c7; border-radius:6px; margin-top:8px; margin-bottom:16px;">
<strong>üîç Sound Familiar?</strong><br><br>
‚Ä¢ You generate 50 AI code reviews per day<br><br>
‚Ä¢ Senior engineer spot-checks 5 of them (10% sample)<br><br>
‚Ä¢ Pipeline is blocked waiting for manual validation<br><br>
‚Ä¢ Meanwhile: 45 reviews ship without verification<br><br>
<strong>Result:</strong> Either bottleneck the pipeline or accept unvalidated outputs.
</div>

**The scaling problem:**
- Human review doesn't scale to hundreds of AI outputs daily
- Spot checks miss systemic issues
- Feedback arrives too late for CI/CD pipelines
- **Ask yourself:** Can you manually verify every AI-generated output?

---

**3. üéØ Inconsistent Standards**

<div style="margin-left:16px; padding:12px; background:#fef3c7; border-radius:6px; margin-top:8px; margin-bottom:16px;">
<strong>üîç Ever Been Here?</strong><br><br>
‚Ä¢ Engineer A accepts AI code review if it mentions security<br><br>
‚Ä¢ Engineer B wants specific line numbers and fix recommendations<br><br>
‚Ä¢ Engineer C rejects anything without performance analysis<br><br>
‚Ä¢ New hire: "What's our acceptance criteria?"<br><br>
‚Ä¢ Team lead: *crickets*<br><br>
<strong>Result:</strong> Every reviewer applies different standards, inconsistent quality.
</div>

**The consistency problem:**
- No codified criteria for "good enough"
- Different reviewers = different thresholds
- Teams struggle to know when to ship vs regenerate
- **Challenge:** Try defining your acceptance criteria right now. How specific can you be?

</div>

---

#### Here's What You'll Build Today

By the end of this section, you'll have an evaluation template that:

‚úÖ **Catches hidden errors** - Semantic evaluation detects factually wrong but well-formatted outputs<br>
‚úÖ **Scales automatically** - Review hundreds of AI outputs without human bottlenecks<br>
‚úÖ **Applies consistent criteria** - Same rubric, same thresholds, every evaluation<br>
‚úÖ **Provides actionable verdicts** - Accept/Revise/Reject decisions your pipeline can automate<br>
‚úÖ **Documents reasoning** - Auditable scores with rationale for every decision

**Ready to build it?** Let's turn those quality gaps into systematic evaluation. ‚¨áÔ∏è

### üíª Working Example: Judge the Section 3.2 Code Review

> **Note:** To avoid the AI model grading its own review or automatically preferring its own output, we switch the judge to a different provider/model so the evaluation comes from an independent model.

This cell replays the Section 3.2 template to generate the comprehensive AI review, then immediately scores it with the judge using the same monthly report diff.

**What you'll see:**
- The full AI review that the template produces
- How the rubric weights accuracy, completeness, actionability, and communication
- An Accept/Revise/Reject recommendation tied to the numeric thresholds

<div style="margin-top:16px; padding:16px; background:#fef3c7; border-left:4px solid #f59e0b; border-radius:8px; color:#78350f;">
<strong>‚ö†Ô∏è Heads-up:</strong> <br>
The next cell first replays the Section 3.2 prompt template to regenerate the AI review, then runs the evaluation rubric on that fresh output.
</div>

<div style="margin-top:16px; color:#991b1b; padding:12px; background:#fee2e2; border-radius:6px; border-left:4px solid #ef4444;">
<strong>‚ö†Ô∏è IMPORTANT:</strong><br>
To avoid the AI model grading its own review or automatically preferring its own output, we switch the judge to a different provider/model so the evaluation comes from an independent model.
</div>

In [None]:
# Example: Judge the Section 3.2 code review output

code_diff = '''
+ import json
+ import time
+ from decimal import Decimal
+
+ CACHE = {}
+
+ def generate_monthly_report(org_id, db, s3_client):
+     if org_id in CACHE:
+         return CACHE[org_id]
+
+     query = f"SELECT * FROM invoices WHERE org_id = '{org_id}' ORDER BY created_at DESC"
+     rows = db.execute(query)
+
+     total = Decimal(0)
+     items = []
+     for row in rows:
+         total += Decimal(row['amount'])
+         items.append({
+             'id': row['id'],
+             'customer': row['customer_name'],
+             'amount': float(row['amount'])
+         })
+
+     payload = {
+         'org': org_id,
+         'generated_at': time.strftime('%Y-%m-%d %H:%M:%S'),
+         'total': float(total),
+         'items': items
+     }
+
+     key = f"reports/{org_id}/{int(time.time())}.json"
+     time.sleep(0.5)
+     s3_client.put_object(
+         Bucket='company-reports',
+         Key=key,
+         Body=json.dumps(payload),
+         ACL='public-read'
+     )
+
+     CACHE[org_id] = key
+     return key
'''

review_messages = [
    {
        "role": "system",
        "content": "You follow structured review templates and produce clear, actionable findings."
    },
    {
        "role": "user",
        "content": f"""
<role>
Act as a Senior Software Engineer specializing in Python backend services.
Your expertise covers security best practices, performance tuning, reliability, and maintainable design.
</role>

<context>
Repository: analytics-platform
Service: Reporting API
Purpose: Add a monthly invoice report exporter that finance can trigger
Change Scope: Review focuses on the generate_monthly_report implementation
Language: python
</context>

<code_diff>
{code_diff}
</code_diff>

<review_guidelines>
Assess the change across multiple dimensions:

1. Security ‚Äî SQL injection, S3 object exposure, sensitive data handling.
2. Performance ‚Äî query efficiency, blocking calls, caching behaviour.
3. Error Handling ‚Äî resilience to empty results, network/storage failures.
4. Code Quality ‚Äî readability, global state, data conversions.
5. Correctness ‚Äî totals, currency precision, repeated report generation.
6. Best Practices ‚Äî configuration management, separation of concerns, testing hooks.
For each finding, cite the diff line, describe impact, and share an actionable fix.
</review_guidelines>

<tasks>
Step 1 - Think: Analyse the diff using the dimensions listed above.
Step 2 - Assess: For each issue, capture Severity (CRITICAL/MAJOR/MINOR/INFO), Category, Line, Issue, Impact.
Step 3 - Suggest: Provide a concrete remediation (code change or process tweak).
Step 4 - Verdict: Summarise overall risk and recommend APPROVE / REQUEST CHANGES / NEEDS WORK.
</tasks>

<output_format>
## Code Review Summary
Write one paragraph on overall health and primary risks

## Findings
For each finding, use this structure:

### {{SEVERITY}} Issue Title
**Category:** Security / Performance / Quality / Correctness / Best Practices
**Line:** Cite the line number
**Issue:** Describe the impact in clear terms
**Recommendation:**
```
# Provide safer / faster / cleaner fix here
```

## Overall Assessment
**Recommendation:** APPROVE | REQUEST CHANGES | NEEDS WORK
**Summary:** Explain what to address before merge
</output_format>
"""
    },
]

print("üîç Generating the Section 3.2 code review...")
print(f"Using {get_provider().upper()} with {get_default_model()}")
print("=" * 70)
ai_generated_review = get_chat_completion(review_messages, temperature=0.0)
print(ai_generated_review)
print("=" * 70)

rubric_prompt = """
<context>
Original pull request diff:
{context}

AI-generated review to evaluate:
{ai_output}
</context>

<rubric>
1. Accuracy (40%): Do identified issues actually exist and are correctly described?
2. Completeness (30%): Are major concerns covered? Any critical issues missed?
3. Actionability (20%): Are recommendations specific and implementable?
4. Communication (10%): Is the review professional, clear, and well-structured?
</rubric>

<instructions>
Score each criterion 1-5 with detailed rationale.
Calculate weighted total: (Accuracy√ó0.4) + (Completeness√ó0.3) + (Actionability√ó0.2) + (Communication√ó0.1)

Recommend:
- ACCEPT (‚â•3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements  
- REJECT (<2.5): Unacceptable quality
</instructions>

Provide structured evaluation with scores, weighted total, recommendation, and feedback.
"""

judge_messages = [
    {"role": "system", "content": "You are a Principal Engineer reviewing AI-generated code feedback."},
    {"role": "user", "content": rubric_prompt.format(context=code_diff, ai_output=ai_generated_review)}
]

original_provider = setup_utils.get_provider()
try:
    setup_utils.set_provider('openai')
    print("‚öñÔ∏è JUDGE EVALUATION IN PROGRESS...")
    print(f"Using {get_provider().upper()} with {get_default_model()}")
    print("=" * 70)
    judge_result = get_chat_completion(judge_messages, temperature=0.0)
    print(judge_result)
    print("=" * 70)
finally:
    setup_utils.set_provider(original_provider)

---

### üèóÔ∏è Understanding What You Just Saw: The Tactical Combination

Now that you've seen the judge in action, let's understand how combining tactics from Module 2 creates a reliable evaluation system.

---

#### Why Add a Judge After Prompt Templates?

Before diving into tactics, let's understand the value:

- **Detect hidden errors:** LLM judges evaluate meaning, not just surface patterns. Paraphrased but wrong answers score poorly even when traditional metrics look fine.
- **Scale automatically:** A second AI call verifies template outputs meet criteria every time‚Äîno human bottleneck for hundreds of daily reviews.
- **Accelerate iteration:** Scores highlight which tactic block needs improvement, letting you A/B test prompts without waiting for manual QA.

---

#### The 6-Tactic Recipe (With the "Why" Behind Each)

Here's how we combine tactics strategically to solve specific failure modes:

<div style="overflow-x:auto;">
<table style="width:100%; border-collapse:collapse; margin:16px 0; background:#fff; border:2px solid #e5e7eb; color:#000000; table-layout:fixed;">
<tr style="background:#f8fafc; font-weight:bold; color:#000000;">
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:25%; word-wrap:break-word;">Tactic</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:35%; word-wrap:break-word;">What It Fixes</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:40%; word-wrap:break-word;">Why LLMs Need This</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>üé≠ Role Prompting</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Generic "good/bad" judgments</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Positions as Principal Engineer ‚Üí Expert evaluation</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>üì¶ Structured Inputs (XML)</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Judge mixes submission with criteria</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Clear boundaries ‚Üí Model knows what vs how</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>üî¢ Rubric Decomposition</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Inconsistent scoring across runs</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Weighted criteria ‚Üí Systematic evaluation</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>üß† Chain-of-Thought</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">"3/5" scores without rationale</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Evidence-based reasoning ‚Üí Auditable</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>üìä Decision Thresholds</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">No clear automation hook</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Numeric gates ‚Üí Accept/Revise/Reject</td>
</tr>
</table>
</div>

**üí° Key Insight:** Each tactic removes one type of evaluation failure. Combine them, and you get reliable, scalable quality gates.

---

#### See the Difference: With vs Without Tactics

<div style="overflow-x:auto;">
<table style="width:100%; border-collapse: collapse; margin:16px 0; background:#fff; border:2px solid #e5e7eb; color:#000000; table-layout:fixed;">
<tr style="background:#f8fafc; font-weight:bold; color:#000000;">
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:20%; word-wrap:break-word;">Scenario</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:40%; word-wrap:break-word;">‚ùå Without Tactics</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:40%; word-wrap:break-word;">‚úÖ With Tactics</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>Code Review Evaluation</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#fef2f2; color:#000000; word-wrap:break-word;">
"This review looks comprehensive. 7/10."
<br><span style="color:#991b1b;">‚Üí No rationale, unclear why 7/10</span>
</td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#ecfdf5; color:#000000; word-wrap:break-word;">
<strong>Accuracy: 4/5</strong> (All issues exist)<br>
<strong>Completeness: 3/5</strong> (Missed performance)<br>
<strong>Weighted: 3.4/5 ‚Üí REVISE</strong><br>
<strong>Feedback:</strong> Add performance section
<br><span style="color:#166534;">‚Üí Specific, actionable, auditable</span>
</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>Inconsistent Scores</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#fef2f2; color:#000000; word-wrap:break-word;">
Run 1: 8/10 "Good work"<br>
Run 2: 6/10 "Needs work"<br>
Run 3: 7/10 "Acceptable"
<br><span style="color:#991b1b;">‚Üí Same input, different scores</span>
</td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#ecfdf5; color:#000000; word-wrap:break-word;">
Every run evaluates:<br>
Accuracy (40%) ‚Üí Completeness (30%) ‚Üí Actionability (20%) ‚Üí Communication (10%)
<br><span style="color:#166534;">‚Üí Consistent criteria every time</span>
</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;"><strong>Pipeline Automation</strong></td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#fef2f2; color:#000000; word-wrap:break-word;">
"This could be better but it's okay"
<br><br>
<span style="color:#991b1b;">‚Üí Can't automate ambiguous output</span>
</td>
<td style="padding:8px; border:1px solid #e5e7eb; background:#ecfdf5; color:#000000; word-wrap:break-word;">
<strong>ACCEPT</strong> (3.6/5 ‚â• 3.5 threshold)
<br><br>
<span style="color:#166534;">‚Üí Clear automation hook</span>
</td>
</tr>
</table>
</div>

---

#### Why Weighted Rubrics?

A single "Is this good?" question lets hidden errors slip through. **Weighted rubrics** give you:

- **Multi-dimensional evaluation:** Accuracy, completeness, actionability, communication
- **Prioritization:** Weight critical criteria higher (e.g., Accuracy 40%, Communication 10%)
- **Measurable thresholds:** Clear numeric gates for automation decisions
- **Auditable feedback:** Every score includes evidence and rationale

<div style="margin-top:12px; padding:12px; background:#fef3c7; border-radius:6px; color:#78350f;">
<strong>üß™ In This Tutorial:</strong> We use 4-criterion weighted rubrics (accuracy, completeness, actionability, communication). Feel free to adjust weights based on your domain‚Äîsecurity-critical systems might weight accuracy 50%.
</div>

---

### Breaking Down the Judge Template: A Walkthrough

Let's dissect the judge template you saw in action. We'll walk through each block and see how it uses Module 2 tactics:

---

#### Block 1: üé≠ Set the Judge Persona

**What it looked like:**
```xml
<role>
You are a Principal Engineer reviewing AI-generated code feedback.
</role>
```

**What this does:** Activates expert evaluation standards. Instead of generic "looks good/bad" judgments, you get analysis a Principal Engineer would apply‚Äîunderstanding what makes feedback production-ready vs. needing revision.

**Module 2 Tactic:** Role Prompting

---

#### Block 2: üî¢ Define Weighted Rubric (What to Evaluate)

**What it looked like:**
```xml
<rubric>
1. Accuracy (40%): Do identified issues actually exist and are correctly described?
2. Completeness (30%): Are major concerns covered? Any critical issues missed?
3. Actionability (20%): Are recommendations specific and implementable?
4. Communication (10%): Is the review professional, clear, and well-structured?
</rubric>
```

**What this does:** Creates a systematic multi-dimensional checklist with explicit priorities. Accuracy gets highest weight (40%) because factually wrong reviews are worse than poorly formatted ones. Every evaluation checks ALL 4 dimensions‚Äîno skipped criteria.

**Module 2 Tactic:** Task Decomposition + Weighted Criteria

---

#### Block 3: üß† Guide the Evaluation Process (How to Score)

**What it looked like:**
```xml
<instructions>
Score each criterion 1-5 with detailed rationale:
- 5: Excellent - Exceeds expectations
- 4: Good - Meets expectations with minor gaps
- 3: Acceptable - Meets minimum bar
- 2: Needs work - Significant gaps
- 1: Unacceptable - Fails to meet standards

Calculate weighted total: (Accuracy√ó0.4) + (Completeness√ó0.3) + (Actionability√ó0.2) + (Communication√ó0.1)

Recommend:
- ACCEPT (‚â•3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements, provide specific guidance
- REJECT (<2.5): Start over with different approach
</instructions>
```

**What this does:** Forces the judge to show its reasoning. You don't get "3/5" scores without explanation‚Äîyou get evidence-based rationale tied to the explicit scale. Weighted calculation and thresholds make decisions consistent and automatable.

**Module 2 Tactic:** Chain-of-Thought + Decision Thresholds

---

#### Block 4: üì¶ Separate Submission from Criteria

**What it looked like:**
```xml
<submission>
{{llm_output_under_review}}
</submission>
```

**What this does:** Clear XML boundaries separate "what to evaluate" from "how to evaluate it." The judge knows the submission content is what needs scoring, not the rubric criteria themselves.

**Module 2 Tactic:** Structured Inputs (XML)

---

#### Block 5: üìä Specify Output Format (How to Report)

**What it looked like:**
```xml
<output_format>
Provide structured evaluation with:
- Individual scores (1-5) with rationale for each criterion
- Weighted total score
- Recommendation (ACCEPT/REVISE/REJECT)
- Specific feedback for improvements
</output_format>
```

**What this does:** Standardizes output for automation. Your pipeline can parse the ACCEPT/REVISE/REJECT decision, extract numeric scores for tracking, and surface improvement feedback. No more free-form text that's hard to act on.

**Module 2 Tactic:** Structured Output

---

#### üîÑ Making It Reusable

**Add variables** for the parts that change between use cases:

```xml
<role>
You are a {{judge_role}} reviewing {{content_type}}.
</role>

<rubric>
1. {{criterion_1_name}} ({{weight_1}}%): {{criterion_1_description}}
2. {{criterion_2_name}} ({{weight_2}}%): {{criterion_2_description}}
...
</rubric>
```

Now you can use the same judge template across different domains:
- Code reviews: `judge_role="Principal Engineer"`, `content_type="AI-generated code feedback"`
- Documentation: `judge_role="Technical Writer"`, `content_type="API documentation"`
- Test specs: `judge_role="QA Lead"`, `content_type="test specifications"`

**One judge template, infinite use cases.** Just adjust the role, criteria, and weights to match your domain.

---

#### Design Principles for Rubrics

**1. Weighted Criteria** ‚Äì Prioritize what matters most (e.g., accuracy first for safety-critical domains).

**2. Explicit Scale** ‚Äì Clear 1-5 definitions stop the judge from drifting between runs.

**3. Evidence-Based Rationale** ‚Äì Forces the model to ground scores in the submission content.

**4. Actionable Thresholds** ‚Äì Numeric gates (3.5, 2.5) enable pipeline automation.

**5. Improvement Guidance** ‚Äì "Revise" outcomes must include next steps for the generator.

---

#### Calibration: Keeping Scores Consistent

The rubric defines **what** to score; calibration ensures **how** it's scored stays consistent. Instead of generic "7/10 - pretty good" language, define anchors:

**Example:** 7/10 = factually accurate with minor gaps, clear structure, appropriate for target audience, missing 1-2 implementation details.

<div style="overflow-x:auto;">
<table style="width:100%; border-collapse:collapse; margin:16px 0; background:#fff; border:2px solid #e5e7eb; color:#000000; table-layout:fixed;">
<tr style="background:#f8fafc; font-weight:bold; color:#000000;">
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:25%; word-wrap:break-word;">Scenario</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:25%; word-wrap:break-word;">9/10 (Excellent)</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:25%; word-wrap:break-word;">5/10 (Acceptable)</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; width:25%; word-wrap:break-word;">2/10 (Needs Work)</td>
</tr>
<tr>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Technical documentation</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Complete, tested, handles edge cases</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Covers main flows, some gaps</td>
<td style="padding:8px; border:1px solid #e5e7eb; color:#000000; word-wrap:break-word;">Basic concepts only, missing details</td>
</tr>
</table>
</div>

**Best Practices:**
- **Anchor scores** with real examples at each level (1, 3, 5)
- **Recalibrate quarterly** with domain experts as standards evolve
- **Check inter-rater reliability** to ensure consistent interpretation

---

## Activity 3.4: Create Your Judge Template

Now that you've seen how the template works, try building one from scratch for a cache refactor scenario.

**Your task:** Create an evaluation template that evaluates AI-generated refactor explanations. Open **[`activities/activity-3.4-evaluation-templates.md`](./activities/activity-3.4-evaluation-templates.md)** and complete the template between the `<!-- TEMPLATE START -->` and `<!-- TEMPLATE END -->` markers.

The template should:
- Set an appropriate judge role (e.g., Senior Engineer reviewing refactor proposals)
- Define weighted rubric criteria (Correctness, Design, Safety, Tests)
- Include explicit scoring scale (1-5) with decision thresholds
- Specify structured output format with verdict and improvement feedback

**The challenge:** The refactor scenario includes subtle issues. Your judge should catch factual inaccuracies, missing test coverage, and design trade-offs.

When you're done, come back and run the cell below to test it. Compare your result with the **[solution](./solutions/activity-3.4-judge-solution.md)** afterward.

### Test Your Judge Template

Run the cell below to test your completed template. This loads your template from the activity file and evaluates the cache refactor scenario.

In [None]:
# Test your judge template with the cache refactor scenario
from setup_utils import test_activity_3_4, get_refactor_judge_scenario

print("üß™ Testing your evaluation template from activity-3.4-evaluation-templates.md...")
print("=" * 70)
judge_preview = test_activity_3_4(variables=get_refactor_judge_scenario())
print("\n" + "=" * 70)
print("\nüí° Review the verdict above. Does it match your expectations?")
print("   - If TODOs remain, complete your template in the activity file")
print("   - If scores seem off, adjust your criteria and re-run this cell")
print("   - To see the reference solution, check solutions/activity-3.4-judge-solution.md")

In [None]:
# Optional: Test with custom scenario
# 
# If you want to test your judge with different code, modify the variables below
# and run this cell. Otherwise, the cell above tests with the standard scenario.

from setup_utils import test_activity_3_4

custom_variables = {
    'service_name': 'TODO - Your Service Name',
    'refactor_brief': 'TODO - What was refactored?',
    'code_before': """
# TODO: Paste original code here
""",
    'code_after': """
# TODO: Paste refactored code here
""",
    'refactor_goal': 'TODO - What was the goal?',
    'test_summary': 'TODO - Test results',
    'analysis_findings': 'TODO - Linter/static analysis output',
    'critical_regression': 'TODO - Any known regression?',
    'security_findings': 'TODO - Security scan results',
    'escalation_channel': '#your-channel',
    'ai_refactor_output': """
# TODO: Paste the AI's explanation of the refactor
"""
}

print("üß™ Testing with custom scenario...")
print("‚ö†Ô∏è Make sure to replace all TODO values above before running!")
print("=" * 70)
judge_result = test_activity_3_4(variables=custom_variables)

---

### Evaluate Your Judge Template

<div style="background:#f0f9ff; border-left:4px solid #0ea5e9; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#0c4a6e;">üí° Want feedback on your judge template?</strong><br><br>

Use <code style="color:#dc2626; background-color:#f1f1f1; padding:2px; font-family:Consolas,'courier new';">evaluate_prompt()</code> to get comprehensive automated feedback (same evaluation system from Section 3.2):

- **Traditional Metrics (40%)**: Pattern detection
- **AI Evaluation (40%)**: Quality scores with confidence levels
- **Semantic Similarity (20%)**: Comparison with reference solution

<strong>üìö For details on evaluation and Confidence Scores, see Section 3.2.</strong>
</div>

**Run the cell below to evaluate your judge template!** ‚¨áÔ∏è

In [None]:
# Optional: Evaluate your Activity 3.4 judge template
from setup_utils import evaluate_prompt, extract_template_from_activity, get_refactor_judge_scenario

template, error = extract_template_from_activity('activities/activity-3.4-evaluation-templates.md')

if error:
    print(error)
else:
    # Define the same variables used in cell 12 for substitution
    # These match what test_activity_3_4() uses to fill the template placeholders
    variables = get_refactor_judge_scenario()
    
    # Substitute variables in template (same logic as test_activity uses internally)
    print("üîÑ Substituting template variables...")
    substituted_template = template
    for key, value in variables.items():
        placeholder = "{{" + key + "}}"
        substituted_template = substituted_template.replace(placeholder, str(value))
    
    print("üìñ Evaluating your Activity 3.4 judge template...")
    print("‚è≥ This will take ~30 seconds\n")
    
    evaluate_prompt(
        messages=substituted_template,  # ‚úÖ Now fully substituted with actual content
        activity_name="Activity 3.4: Evaluation Template",
        expected_tactics=[
            "Role Prompting",
            "Structured Inputs",
            "Output Format Specification",
            "Chain-of-Thought",
            "Evaluation Rubric",
            "Weighted Criteria"
        ],
        activity_file='activities/activity-3.4-evaluation-templates.md',
        compare_with_reference=True,
        track_progress=True
    )
    
    print("\n‚úÖ Evaluation complete!")
    print("Next: Run view_progress() below to see your improvement!")

---

## Track Your Progress

After completing the evaluation above, run the cell below to see your learning journey:

- üìä All your evaluation attempts for this section  
- üìà Your improvement over time
- üèÜ Achievement status (scores ‚â• 80 earn **SKILLS ACQUIRED** badge!)

---

In [None]:
# üìä VIEW YOUR PROGRESS
# Run this cell anytime to see your evaluation history and improvement

from setup_utils import view_progress

print("=" * 70)
print("üìä YOUR SECTION 3.4 PROGRESS")
print("=" * 70)
print()

view_progress("Activity 3.4: Evaluation Template")

print()
print("=" * 70)
print("üí° TIP: Scored ‚â• 80? You've mastered evaluation templates!")
print("=" * 70)

---
### What's Next: From Manual Judging to Systematic Evals

**You just tested your judge on one scenario.** To use this in production, you need **systematic evaluations** that track judge performance over time.

#### Why Evals Matter

Manual testing validates one case. **Evals** validate your judge across hundreds of cases and track metrics:
- **Accuracy**: Does your judge correctly identify good vs bad refactors?
- **False positives**: How often does it block acceptable changes?
- **Consistency**: Does rubric v2 improve on v1?

**Learn why evals are critical:** [Why LLM Evals Matter](https://www.youtube.com/watch?v=vygFgCNR7WA&list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S)

#### Production Eval Platforms

Scale your judge with evaluation platforms:

- **[OpenAI Platform Evals](https://platform.openai.com/docs/guides/evals)**: Dashboard-based systematic evaluation with datasets and metrics
- **[Anthropic Evaluation Tool](https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool)**: Console-based prompt testing with side-by-side comparison

#### Quick Start

1. **Build an eval dataset**: Collect 10-20 refactors with known verdicts
2. **Run systematic evals**: Test your judge template against the dataset
3. **Track metrics**: Measure accuracy, iterate on rubric weights
4. **Compare models**: Test if GPT-4o vs Claude performs better as judge

---

---

<div style="padding:16px; background:linear-gradient(135deg, #667eea 0%, #764ba2 100%); border-radius:10px; color:#fff; text-align:center; box-shadow:0 4px 15px rgba(102,126,234,0.3);">
  <strong style="font-size:1.05em;">üéâ Excellent work! You've completed the advanced evaluation section.</strong><br>
  <span style="font-size:0.92em; opacity:0.95; margin-top:4px; display:block;">Take a moment to reflect on what you've learned before moving forward.</span>
</div>

---

<div style="padding:20px; background:linear-gradient(135deg, #a8edea 0%, #fed6e3 100%); border-radius:10px; text-align:center; box-shadow:0 4px 15px rgba(168,237,234,0.3); margin-top:24px;">
  <div style="font-size:2em; margin-bottom:10px;">üéä</div>
  <div style="font-size:1.3em; font-weight:700; color:#2d3748; margin-bottom:8px;">Congratulations on Completing Module 3!</div>
  <div style="font-size:0.95em; color:#2d3748; opacity:0.9;">You're now equipped to build, evaluate, and deploy production-ready AI automation workflows.</div>
</div>


### What You Built
You've mastered all core sections of Module 3, learning to build production-ready prompt templates and quality gates for AI-powered development workflows.

**Section 3.2:** Code Review Automation ‚Äî Comprehensive review templates with severity classification  
**Section 3.3:** Test Generation Automation ‚Äî Requirements-to-tests with ambiguity detection  
**Section 3.4:** Evaluation Templates ‚Äî Weighted rubrics for automated quality gates

### Key Skills Acquired

<div style="background:#fff; border:2px solid #e5e7eb; padding:16px; border-radius:8px; margin:16px 0; color:#000000;">

**üéØ Template Design**
- ‚úÖ Multi-tactic stacking (role + structure + reasoning + output)
- ‚úÖ Variable substitution for reusable templates
- ‚úÖ Command-style organization for automation

**‚öñÔ∏è Evaluation Systems**
- ‚úÖ Weighted rubric design (accuracy, completeness, actionability)
- ‚úÖ Decision thresholds (Accept/Revise/Reject)
- ‚úÖ Evidence-based reasoning with confidence scores

**üîÑ Production Workflows**
- ‚úÖ Multi-dimensional code review (security, performance, quality)
- ‚úÖ Systematic test specification (ambiguities ‚Üí coverage ‚Üí specs)
- ‚úÖ Automated quality gates with LLM judges

</div>

<div style="padding:12px; background:#dbeafe; border-radius:6px; border-left:4px solid #3b82f6; color:#1e40af; margin-top:16px; margin-bottom:16px;">
<strong>üìù Skills Demonstrated</strong><br><br>
If you scored ‚â• 80 on the activities, you've demonstrated the ability to:
<ul style="margin:8px 0 0 0;">
<li>Design production-ready prompt templates from scratch</li>
<li>Combine multiple tactics into reliable automation workflows</li>
<li>Build evaluation rubrics that scale to hundreds of outputs</li>
<li>Implement quality gates with clear decision thresholds</li>
</ul>
</div>

---

### Additional Resources

<div style="background:#fff; border:2px solid #e5e7eb; padding:16px; border-radius:8px; margin:16px 0; color:#000000;">

**Evaluation Platforms:**
- **[OpenAI Platform Evals](https://platform.openai.com/docs/guides/evals)** ‚Äî Dashboard-based systematic evaluation with datasets
- **[Anthropic Evaluation Tool](https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool)** ‚Äî Console-based prompt testing

**Production Patterns:**
- **[AWS Anthropic Patterns](https://github.com/aws-samples/anthropic-on-aws/tree/main/advanced-claude-code-patterns)** ‚Äî Production command patterns
- **[OpenAI GPT-5 Guide](https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide)** ‚Äî Latest prompting techniques

**Learning Resources:**
- **[Why LLM Evals Matter](https://www.youtube.com/watch?v=vygFgCNR7WA)** ‚Äî Video series on evaluation strategies
- **[Evaluation Challenges](https://youtu.be/vBJF2sy1Pyw)** ‚Äî Understanding evaluation pitfalls

</div>

---

<div style="padding:24px 28px; background:linear-gradient(135deg, #10b981 0%, #059669 100%); border-radius:12px; box-shadow:0 4px 20px rgba(16,185,129,0.4); margin-top:24px; color:#fff;">
  <div style="text-align:center; margin-bottom:20px;">
    <div style="font-size:3em; margin-bottom:8px;">üéì</div>
    <div style="font-size:1.4em; font-weight:700; margin-bottom:8px;">Course Complete!</div>
    <div style="font-size:1.05em; opacity:0.95; line-height:1.5;">You've mastered the Advanced Prompt Engineering for Developers course</div>
  </div>
  
  <div style="background:rgba(255,255,255,0.15); border-radius:8px; padding:20px; margin:20px 0; backdrop-filter:blur(10px);">
    <div style="font-size:0.95em; font-weight:600; margin-bottom:12px; text-transform:uppercase; letter-spacing:1px;">üèÜ What You've Accomplished</div>
    <div style="font-size:0.92em; line-height:1.7; opacity:0.95;">
      <strong>Module 1:</strong> Foundations & prompt anatomy<br>
      <strong>Module 2:</strong> Core tactics (roles, structure, reasoning, patterns)<br>
      <strong>Module 3:</strong> Production workflows (code review, test generation, evaluation templates)<br>
    </div>
  </div>
  
  <div style="background:rgba(255,255,255,0.15); border-radius:8px; padding:20px; margin:20px 0; backdrop-filter:blur(10px);">
    <div style="font-size:0.95em; font-weight:600; margin-bottom:12px; text-transform:uppercase; letter-spacing:1px;">üöÄ You're Now Ready To</div>
    <div style="font-size:0.92em; line-height:1.8; opacity:0.95;">
      ‚úì Design production-ready prompts that scale across your team<br>
      ‚úì Build automated workflows with multi-tactic prompt templates<br>
      ‚úì Implement quality gates using evaluation rubrics<br>
      ‚úì Integrate prompt patterns into your development environment<br>
      ‚úì Lead prompt engineering initiatives at your organization
    </div>
  </div>
  
  <div style="background:rgba(255,255,255,0.15); border-radius:8px; padding:20px; margin:20px 0; backdrop-filter:blur(10px);">
    <div style="font-size:0.95em; font-weight:600; margin-bottom:12px; text-transform:uppercase; letter-spacing:1px;">üí° Continue Your Journey</div>
    <div style="font-size:0.92em; line-height:1.8; opacity:0.95;">
      ‚Ä¢ Apply these patterns to your real projects<br>
      ‚Ä¢ Share templates with your team<br>
      ‚Ä¢ Iterate and refine based on production feedback<br>
      ‚Ä¢ Build systematic evaluation datasets<br>
      ‚Ä¢ Contribute to the prompt engineering community
    </div>
  </div>
  
  <div style="text-align:center; margin-top:24px; padding-top:20px; border-top:2px solid rgba(255,255,255,0.2);">
    <div style="font-size:1.1em; font-weight:600; margin-bottom:8px;">Thank you for completing this course!</div>
    <div style="font-size:0.9em; opacity:0.9;">Keep building, experimenting, and pushing the boundaries of what's possible with AI.</div>
  </div>
</div>