# Section 2.4: Automation and Evaluation

| **Aspect** | **Details** |
|-------------|-------------|
| **Goal** | Master prompt chaining, self-correction loops, and LLM-as-judge tactics. |
| **Time** | ~25 minutes |
| **Prerequisites** | Complete Sections 2.1–2.3 and be comfortable with reasoning patterns. |
| **Next Steps** | Continue to Section 2.5: Practice and Solutions |

---

## 🔧 Quick Setup Check

Since you completed Sections 2.1-2.3, setup is already done! We just need to import it.

In [ ]:
# Quick setup check - imports setup_utils
try:
    import importlib
    import setup_utils
    importlib.reload(setup_utils)
    from setup_utils import *
    print(f"✅ Setup loaded! Using {AVAILABLE_PROVIDERS} with {get_default_model()}")
    print("🚀 Ready to build automated workflows!")
except ImportError:
    print("❌ Setup not found!")
    print("💡 Please run 2.1-setup-and-foundations.ipynb first to set up your environment.")

---

### 🔗 Tactic 6: Prompt Chaining

**Break complex tasks into sequential workflows**

**Core Principle:** Complex tasks can cause AI to "drop the ball" if handled in a single prompt. Prompt chaining breaks tasks into smaller, manageable subtasks where each step gets focused attention.

**Understanding Prompt Chaining Through Analogy**

Think of the difference between asking one person to do everything versus using an assembly line process.

**❌ Single Prompt: The Overwhelmed Solo Worker**

Imagine giving someone this massive task all at once:  
```code
Build a complete car from scratch: design it, create the engine, assemble the body, paint it, install electronics, and test everything—all in one go.
```

**Problems with this approach:**
- Too much to keep track of at once and results in loss of detail
- Quality suffers when juggling multiple complex tasks
- Early mistakes cascade through the entire process
- No chance to course-correct between steps
- Can't bring specialized expertise to each phase

**✅ Prompt Chaining: The Focused Assembly Line**

Now imagine breaking the same work into a sequence where each step has one clear job:

```code
Station 1: Design the car → passes blueprints to →
Station 2: Build the engine → passes engine to →
Station 3: Assemble the body → passes frame to →
Station 4: Paint and finish → passes to →
Station 5: Quality testing → delivers final product
```

**Benefits of this approach:**
- **Focus:** Each step tackles one thing with full attention
- **Quality:** Specialized expertise at each stage improves results
- **Control:** You can inspect and adjust between steps
- **Prevention:** Errors are caught early before they spread
- **Building blocks:** Each step creates verified work for the next

**The Key Insight:** Just like in manufacturing, breaking complex AI tasks into focused steps produces better results than trying to do everything at once.

**Single Prompt Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #666; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="text-align: center; margin-bottom: 15px;">
    <div style="background-color: #4A90E2; color: white; padding: 10px; border-radius: 5px; display: inline-block; font-size: 14px; font-weight: bold;">
      Input (Complex Request)
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #333;">↓</div>
  <div style="text-align: center; margin: 15px 0;">
    <div style="background-color: #E8E8E8; padding: 15px; border-radius: 5px; display: inline-block; min-width: 200px; color: #333;">
      <strong style="font-size: 14px;">[AI Processing]</strong>
      <div style="margin-top: 10px; text-align: left; font-size: 14px; line-height: 1.6;">
        - Task A<br/>
        - Task B<br/>
        - Task C<br/>
        - Task D
      </div>
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #50C878; color: white; padding: 10px; border-radius: 5px; display: inline-block; font-size: 14px; font-weight: bold;">
      Output (All Results at Once)
    </div>
  </div>
</div>

Characteristics:

- Linear, one-shot processing
- No intermediate verification
- Context dilution across tasks
- Limited depth per subtask

**Prompt Chaining Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #666; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 1</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task A]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 1</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(becomes input)</span></div>
  <div style="margin: 15px 0; padding-left: 40px;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 2</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task B]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 2</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 40px; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(becomes input)</span></div>
  <div style="margin: 15px 0; padding-left: 80px;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 3</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task C]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 3</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 80px; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #FFD700; color: #333; padding: 10px 20px; border-radius: 5px; display: inline-block; font-weight: bold; font-size: 14px;">
      Final Result
    </div>
  </div>
</div>

**Characteristics:**

* Sequential, focused processing
* Verification points between stages
* Context accumulation and refinement
* Deep specialization per task

**Practical Example: Code Review**

Let's see this in action with a real software engineering task.

**❌ Single Prompt Approach:**

```text
Review this authentication service code for security issues, performance problems, code quality, and generate fixes with tests.
```

**Problems:**
- AI must juggle security analysis + performance review + quality check + fix generation
- May miss subtle vulnerabilities while thinking about tests
- Fixes generated without thorough analysis
- No opportunity to validate findings before committing to solutions

**✅ Prompt Chain Approach:**

```text
Chain 1: "Analyze this authentication code for security vulnerabilities only"
         → Produces: List of security issues with severity ratings
         
Chain 2: "Review the same code for performance bottlenecks"
         → Produces: Performance analysis with metrics
         
Chain 3: "Evaluate code quality and maintainability"
         → Produces: Quality assessment
         
Chain 4: "Based on ALL the analyses above, generate prioritized fixes"
         → Produces: Comprehensive solutions addressing all issues
         
Chain 5: "Create tests for the fixed implementation"
         → Produces: Test suite validating the improvements
```

**Benefits:**
- Each analysis gets full attention without distraction
- You can verify findings at each stage
- Later steps have complete context from earlier work
- Higher quality, more thorough results

---

### When to Use Each Approach

**Use Single Prompt When:**
- ✅ Task is simple and self-contained
- ✅ You need a quick, one-off response
- ✅ The request has no interdependent parts
- ✅ Speed matters more than depth
- ✅ Example: "Format this JSON" or "Explain what this function does"

**Use Prompt Chaining When:**
- 🔗 Task is complex with multiple stages
- 🔗 Quality and accuracy are critical
- 🔗 You need to verify intermediate results
- 🔗 Later steps depend on earlier outputs
- 🔗 You want to iterate and refine
- 🔗 Working with large codebases or documentation
- 🔗 Examples: Multi-file refactoring, comprehensive security audits, architecture decisions

---

### Implementation Best Practices

**How to Chain Effectively:**
1. **Identify subtasks:** Break task into distinct, sequential steps
2. **Use XML tags:** Pass outputs between prompts with structured tags like `<analysis>`, `<review>`, `<code>`
3. **Single objective:** Each step has one clear goal
4. **Iterate:** Refine problematic steps without redoing entire chain
5. **Pass context forward:** Each step receives relevant outputs from previous steps

**Common Patterns from Real Workflows:**
- **Content Creation:** Research → Outline → Draft → Edit → Format
- **Code Review:** Analyze → Rate Severity → Generate Fixes → Validate
- **Document Analysis:** Extract key points → Summarize → Generate insights
- **Self-Correction:** Generate → Review own work → Improve → Verify

---

### Advanced Pattern - Self-Correction Chains

**Want to take prompt chaining further?** Create chains where AI reviews and improves its own work!

**The Analogy:** Like a manuscript going through editorial revisions:
- **Regular Chain:** Draft → Edit → Publish (one pass)
- **Self-Correcting:** Draft → Critique → Revise → Verify → Publish (feedback loops)

**The 3-Step Pattern:**
```
1. Generate → AI creates initial solution
2. Critique → AI reviews its OWN work for issues
3. Improve → AI fixes identified problems
```

**Self-Correcting Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #9370DB; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[Generate Draft]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #FFB347; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Draft v1</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; color: #333;">↓</div>
  <div style="margin: 15px 0; padding-left: 40px;">
    <span style="background-color: #9370DB; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[Critique & Identify Issues]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #DC143C; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Issues List</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 40px; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(feedback loop)</span></div>
  <div style="margin: 15px 0; padding-left: 80px;">
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[Revise Based on Critique]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #32CD32; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Improved v2</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 80px; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #FFD700; color: #333; padding: 10px 20px; border-radius: 5px; display: inline-block; font-weight: bold; font-size: 14px;">
      Final Verified Output
    </div>
  </div>
</div>

**When to Use:**
- ✅ Production code generation (high quality needed)
- ✅ Critical analysis tasks (accuracy matters)
- ✅ Automated QA workflows (no human review available)

**Trade-offs:**
- ⚠️ More API calls = Higher cost
- ⚠️ Longer execution time
- ✅ Best for quality over speed

<div style="margin:16px 0; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b; color:#78350f;">
<strong>💡 Pro Tip:</strong> The example below shows a 4-step self-correction with quality scoring. Run it to see how AI improves its own code!
</div>

In [None]:
# Shared setup helpers (run Section 2.1 first to install dependencies)
from setup_utils import get_chat_completion


In [None]:
# Self-Correcting Chain Example: Email Validator Code

# Helper functions for parsing XML tags and formatting output
def extract_between_tags(text: str, tag: str) -> str:
    """Extract content between XML-style tags."""
    start_tag = f"<{tag}>"
    end_tag = f"</{tag}>"
    
    if start_tag in text and end_tag in text:
        start = text.find(start_tag) + len(start_tag)
        end = text.find(end_tag)
        return text[start:end].strip()
    return text

def print_section(title: str, content: str = "", color_code: str = "96"):
    """Print formatted section with color and borders."""
    print(f"\n\033[{color_code}m{'=' * 70}")
    print(f"{title}")
    print(f"{'=' * 70}\033[0m")
    if content:
        print(content)

def self_correcting_chain(requirement: str) -> dict:
    """
    Execute a self-correcting chain with automatic quality improvement.
    
    Args:
        requirement: The task description
        
    Returns:
        Dictionary containing all stages of the process
    """
    results = {}
    
    # STEP 1: Generate Initial Solution
    print_section("STEP 1: Generate Initial Solution", "", "94")
    
    initial_messages = [{
        "role": "user",
        "content": f"""Task: {requirement}

Requirements:
- Provide complete, working code
- Include docstrings and comments
- Handle edge cases
- Use best practices

Wrap your code in <code></code> tags."""
    }]
    
    initial_response = get_chat_completion(initial_messages)
    initial_code = extract_between_tags(initial_response, "code")
    results['initial'] = initial_code
    
    print(initial_response)
    
    # STEP 2: Self-Review with Detailed Criteria
    print_section("STEP 2: AI Self-Review & Critique", "", "93")
    
    critique_messages = [{
        "role": "user",
        "content": f"""You previously wrote this code:

<code>
{initial_code}
</code>

Now, critically review YOUR OWN code using these criteria:

**Security:**
- Input validation vulnerabilities
- Injection risks
- Data exposure issues

**Correctness:**
- Edge cases (empty strings, None, special characters)
- Boundary conditions
- Logic errors

**Code Quality:**
- Error handling completeness
- Code clarity and maintainability
- Performance considerations
- Following Python best practices (PEP 8)

Provide your analysis in this format:
<issues>
[List specific issues found, or write "No critical issues found"]
</issues>

<severity>
[Rate overall: CRITICAL / MODERATE / MINOR / NONE]
</severity>

<recommendations>
[Specific improvements to make]
</recommendations>"""
    }]
    
    critique_response = get_chat_completion(critique_messages)
    issues = extract_between_tags(critique_response, "issues")
    severity = extract_between_tags(critique_response, "severity")
    recommendations = extract_between_tags(critique_response, "recommendations")
    
    results['critique'] = {
        'full_response': critique_response,
        'issues': issues,
        'severity': severity,
        'recommendations': recommendations
    }
    
    print(critique_response)
    
    # STEP 3: Self-Improvement Based on Critique
    print_section("STEP 3: AI Self-Improvement", "", "92")
    
    improve_messages = [{
        "role": "user",
        "content": f"""Your original code:
<original_code>
{initial_code}
</original_code>

Your self-review identified these issues:
<issues>
{issues}
</issues>

Severity: {severity}

Recommendations:
<recommendations>
{recommendations}
</recommendations>

Now improve your code by:
1. Fixing all identified issues
2. Implementing the recommendations
3. Adding comprehensive error handling
4. Including detailed docstrings

Provide the improved code in <improved_code></improved_code> tags.
Also explain what you changed in <changes></changes> tags."""
    }]
    
    improved_response = get_chat_completion(improve_messages)
    improved_code = extract_between_tags(improved_response, "improved_code")
    changes = extract_between_tags(improved_response, "changes")
    
    results['improved'] = {
        'code': improved_code,
        'changes': changes,
        'full_response': improved_response
    }
    
    print(improved_response)
    
    # STEP 4: Final Verification
    print_section("STEP 4: Final Quality Verification", "", "95")
    
    verify_messages = [{
        "role": "user",
        "content": f"""Review this final code and confirm all issues are resolved:

<code>
{improved_code}
</code>

Provide:
<verification_status>PASS or FAIL</verification_status>
<remaining_issues>List any remaining issues, or "None"</remaining_issues>
<quality_score>Rate 1-10</quality_score>"""
    }]
    
    verification_response = get_chat_completion(verify_messages)
    status = extract_between_tags(verification_response, "verification_status")
    remaining = extract_between_tags(verification_response, "remaining_issues")
    score = extract_between_tags(verification_response, "quality_score")
    
    results['verification'] = {
        'status': status,
        'remaining_issues': remaining,
        'quality_score': score,
        'full_response': verification_response
    }
    
    print(verification_response)
    
    return results

# Run the self-correcting chain
requirement = "Function to validate email addresses with comprehensive checks"

print("\033[96m" + "=" * 70)
print("🔄 SELF-CORRECTING CHAIN DEMONSTRATION")
print("=" * 70 + "\033[0m")
print(f"Requirement: {requirement}\n")

results = self_correcting_chain(requirement)

# Print summary
print_section("📊 EXECUTION SUMMARY", "", "96")

print(f"""
Initial Code Length:     {len(results['initial'])} characters
Issues Identified:       {results['critique']['severity']}
Improved Code Length:    {len(results['improved']['code'])} characters
Verification Status:     {results['verification']['status']}
Quality Score:           {results['verification']['quality_score']}/10

Key Changes Made:
{results['improved']['changes']}

Remaining Issues:
{results['verification']['remaining_issues']}
""")

#### Key Takeaways: Prompt Chaining

**The Core Idea:**
- **Single Prompt** = One person doing everything (overwhelmed)
- **Prompt Chaining** = Assembly line where each station specializes (focused)
- **Self-Correcting** = Assembly line + quality inspectors at each stage (verified)

**When to Chain:**
- ✅ Complex tasks with multiple stages
- ✅ Quality matters more than speed
- ✅ Need to verify intermediate results
- ✅ Later steps depend on earlier outputs

**How to Chain:**
1. Break into sequential steps (Analyze → Fix → Test)
2. Pass outputs forward using XML tags (`<analysis>`, `<code>`)
3. One clear goal per step
4. Use specialized roles per step (security engineer → QA)

**Common Patterns:**
- **Code Review:** Analyze → Rate severity → Generate fixes → Validate
- **Self-Correction:** Generate → Critique own work → Improve → Verify
- **Security Audit:** Scan → Prioritize → Fix → Verify

<div style="margin:16px 0; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b; color:#78350f;">
<strong>💡 Pro Tip:</strong> If a step performs poorly, isolate and fix just that step without redoing the entire chain.
</div>

*Reference: [Claude Documentation - Chain Complex Prompts](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/chain-prompts)*

---

### 🎯 Try It Yourself: Prompt Chaining

**Common Misconception:** AI can handle multiple complex tasks in one prompt just as well as breaking them into steps.

**The Reality:** Chaining gives each task full attention, dramatically improving quality and reducing errors.

**Your Task:** Below is code with security and validation issues. The BAD prompt asks AI to do everything at once (review + fix + tests). 

Your task is to:

1. First, run the cell to see the BAD single-prompt approach
2. Then, uncomment the GOOD 3-step chain and run again
3. Compare the depth and quality of both approaches

The GOOD approach breaks it into:
- **Step 1:** Analyze issues only
- **Step 2:** Fix code based on analysis
- **Step 3:** Generate tests for fixed code

See how focused steps produce better results!

In [None]:
# Code with multiple issues
problematic_code = """
def process_payment(amount, card_number):
    if amount > 0:
        charge_card(card_number, amount)
        return "success"
"""

# ❌ BAD: Everything at once
print("=" * 70)
print("BAD: Single Prompt (Everything at Once)")
print("=" * 70)

bad_messages = [{
    "role": "user",
    "content": f"Review this code, fix all issues, and write tests:\n\n{problematic_code}"
}]

bad_response = get_chat_completion(bad_messages)
print(bad_response[:400] + "...\n")  # Show truncated output

# ✅ YOUR TURN: Uncomment to see the focused 3-step approach
#
# print("=" * 70)
# print("GOOD: Prompt Chain (3 Focused Steps)")
# print("=" * 70)
#
# # STEP 1: Analyze
# step1 = get_chat_completion([{
#     "role": "user",
#     "content": f"Analyze this code for issues:\n{problematic_code}\n\nList: security, validation, error handling."
# }])
# print("\n🔍 STEP 1 - Analysis:")
# print(step1)
#
# # STEP 2: Fix based on analysis
# step2 = get_chat_completion([{
#     "role": "user",
#     "content": f"Fix this code based on analysis:\n\nCode: {problematic_code}\n\nIssues: {step1}"
# }])
# print("\n🔧 STEP 2 - Fixed Code:")
# print(step2)
#
# # STEP 3: Generate tests
# step3 = get_chat_completion([{
#     "role": "user",
#     "content": f"Write tests for:\n{step2}\n\nInclude: happy path, edge cases, error handling."
# }])
# print("\n✅ STEP 3 - Tests:")
# print(step3)
#
# print("\n" + "=" * 70)
# print("💡 COMPARISON:")
# print("  Single: Tries everything → shallow analysis")
# print("  Chain: One focus per step → thorough results")
# print("=" * 70)

---

<div style="margin:20px 0; padding:16px 24px; background:linear-gradient(135deg, #f093fb 0%, #f5576c 100%); border-radius:10px; color:#fff; text-align:center; box-shadow:0 4px 15px rgba(240,147,251,0.3);">
  <strong style="font-size:1.05em;">🌟 Excellent progress! Deep learning requires breathing room.</strong><br>
  <span style="font-size:0.92em; opacity:0.95; margin-top:4px; display:block;">Step away for a few minutes—your brain will thank you when you tackle the final tactics.</span>
</div>

---

---

### ⚖️ Tactic 7: LLM-as-Judge

**Create evaluation rubrics and self-critique loops**

**Core Principle:** One of the most powerful patterns in prompt engineering is using an AI model as a judge or critic to evaluate and improve outputs. This creates a self-improvement loop where the AI reviews, critiques, and refines work—either its own outputs or those from other sources.

**Why Use LLM-as-Judge:**
- **Quality assurance:** Catch errors, inconsistencies, and areas for improvement
- **Objective evaluation:** Get unbiased assessment based on specific criteria
- **Iterative refinement:** Continuously improve outputs through multiple review cycles
- **Scalable review:** Automate code reviews, documentation checks, and quality audits

**When to Use LLM-as-Judge:**
- Code review and quality assessment
- Evaluating multiple solution approaches
- Grading or scoring responses against rubrics
- Providing constructive feedback on technical writing
- Testing and validation of AI-generated content
- Comparing different implementations

**How to Implement:**
1. **Define clear criteria:** Specify what makes a good/bad output
2. **Provide rubrics:** Give the judge specific evaluation dimensions
3. **Request structured feedback:** Ask for scores, ratings, or categorized feedback
4. **Include examples:** Show what excellent vs. poor outputs look like
5. **Iterate:** Use feedback to improve and re-evaluate

Automated code reviews, architecture decision validation, test coverage assessment, documentation quality checks, and comparing multiple implementation approaches all benefit from LLM-as-Judge.

*Reference: This technique combines elements from evaluation frameworks and self-critique patterns used in production AI systems.*


#### Example 1: Code Quality Judge

Let's use AI as a judge to evaluate and compare two different implementations:


In [None]:
# Self-correction chain: AI reviews and improves its own code

requirement = "Function to validate email addresses"

# STEP 1: Generate initial solution
print("="*70)
print("STEP 1: Generate Initial Solution")
print("="*70)

initial_messages = [{
    "role": "user",
    "content": f"{requirement}\n\nProvide code in <code> tags."
}]

initial = get_chat_completion(initial_messages)
print(initial)
print()

# STEP 2: Self-review (AI critiques its OWN work)
print("="*70)
print("STEP 2: AI Reviews Its Own Work")
print("="*70)

critique_messages = [{
    "role": "user",
    "content": f"""Review YOUR code for issues:

{initial}

Check for:
- Security vulnerabilities
- Edge cases not handled
- Missing validation
- Code quality issues

Identify problems in <issues> tags. If none found, say "No issues found"."""
}]

critique = get_chat_completion(critique_messages)
print(critique)
print()

# STEP 3: Self-improve based on own critique
print("="*70)
print("STEP 3: AI Improves Based on Self-Review")
print("="*70)

improve_messages = [{
    "role": "user",
    "content": f"""Your original code:
{initial}

Your self-review identified:
{critique}

If issues were found, provide improved code in <improved_code> tags.
If no issues, return the original code.
Only change what's necessary."""
}]

improved = get_chat_completion(improve_messages)
print(improved)

print("\n" + "="*70)
print("💡 KEY INSIGHT")
print("="*70)
print("Self-correction chains let AI:")
print("  ✓ Catch its own mistakes automatically")
print("  ✓ Iterate without human intervention")
print("  ✓ Improve quality through self-reflection")
print("\nUse this for automated workflows and quality assurance!")

#### Example 2: Self-Critique and Improvement Loop

Now let's create an improvement loop where AI generates code, critiques it, and then improves it:


In [None]:
# Example: Self-Critique and Improvement Loop
requirement = "Create a function that validates and sanitizes user input for a SQL query"

# STEP 1: Generate initial solution
print("=" * 70)
print("STEP 1: Generate Initial Solution")
print("=" * 70)

generate_messages = [
    {
        "role": "system",
        "content": "You are a Python developer. Generate code solutions."
    },
    {
        "role": "user",
        "content": f"""{requirement}

Provide your implementation in <code> tags."""
    }
]

initial_code = get_chat_completion(generate_messages)
print(initial_code)
print("\n")

# STEP 2: Critique the solution
print("=" * 70)
print("STEP 2: Critique the Solution")
print("=" * 70)

critique_messages = [
    {
        "role": "system",
        "content": """You are a security-focused code reviewer. 
        
Evaluate code for:
- Security vulnerabilities
- Best practices
- Error handling
- Edge cases
- Code quality

Provide brutally honest feedback with specific issues and severity levels."""
    },
    {
        "role": "user",
        "content": f"""Requirement: {requirement}

Initial implementation:
{initial_code}

Critique this implementation. Identify all issues, rate severity (Critical/High/Medium/Low), and suggest specific improvements.

Structure your response:
<critique>Your detailed critique</critique>
<issues>List of specific issues with severity</issues>
<suggestions>Actionable improvement suggestions</suggestions>"""
    }
]

critique = get_chat_completion(critique_messages)
print(critique)
print("\n")

# STEP 3: Improve based on critique
print("=" * 70)
print("STEP 3: Improved Implementation")
print("=" * 70)

improve_messages = [
    {
        "role": "system",
        "content": "You are a senior Python developer who learns from feedback."
    },
    {
        "role": "user",
        "content": f"""Requirement: {requirement}

Original implementation:
{initial_code}

Critique received:
{critique}

Create an improved implementation that addresses ALL the issues raised in the critique.
Provide the improved code in <improved_code> tags and explain key changes in <changes> tags."""
    }
]

improved_code = get_chat_completion(improve_messages)
print(improved_code)


#### Key Takeaways: LLM-as-Judge

**What We Demonstrated:**

**Example 1: Code Quality Judge**
- Defined clear evaluation criteria with weights
- Provided structured rubrics for assessment
- Got objective comparison of two implementations
- Received scored evaluation with pros/cons and recommendation

**Example 2: Self-Critique and Improvement Loop**
- **Step 1:** Generated initial code solution
- **Step 2:** Used AI as brutal critic to identify issues
- **Step 3:** Improved code based on critique feedback
- Created a self-improvement cycle

**Benefits of LLM-as-Judge:**

1. **Objective Evaluation:**
   - Unbiased assessment based on defined criteria
   - Consistent scoring across multiple evaluations
   - Reduces human bias in code reviews

2. **Continuous Improvement:**
   - Iterative refinement through critique loops
   - Learn from mistakes and feedback
   - Progressive quality enhancement

3. **Scalable Reviews:**
   - Automate repetitive evaluation tasks
   - Handle multiple implementations simultaneously
   - Save senior engineers' time for complex decisions

4. **Structured Feedback:**
   - Clear, actionable improvement suggestions
   - Severity ratings for prioritization
   - Specific examples and recommendations

**Real-World Applications:**

- **Automated Code Reviews:** Evaluate PRs against coding standards before human review
- **Architecture Decisions:** Compare multiple design approaches objectively
- **Test Quality Assessment:** Evaluate test coverage and edge case handling
- **Documentation Quality:** Grade documentation completeness and clarity
- **API Design Review:** Compare REST vs GraphQL implementations
- **Performance Optimization:** Evaluate before/after optimization attempts
- **Security Audits:** Systematic vulnerability assessment with severity ratings

**Implementation Patterns:**

```python
# Pattern 1: Single evaluation
judge_prompt = """
Evaluate [OUTPUT] based on:
1. Criterion A (weight: X%)
2. Criterion B (weight: Y%)

Provide scores and recommendation.
"""

# Pattern 2: Comparative evaluation
judge_prompt = """
Compare [OPTION_A] and [OPTION_B] against:
- Criteria 1
- Criteria 2
- Criteria 3

Recommend the better option with justification.
"""

# Pattern 3: Self-improvement loop
1. Generate solution
2. Critique solution (AI as judge)
3. Improve based on critique
4. (Optional) Re-evaluate improvement
```

**Pro Tips:**
- **Define clear rubrics:** Specific criteria produce better judgments
- **Use weighted scoring:** Prioritize what matters most
- **Request examples:** Ask for specific code snippets in feedback
- **Iterate multiple times:** Don't stop at first critique
- **Combine with other tactics:** Use with prompt chaining for multi-stage reviews


---

### 🎯 Try It Yourself: LLM-as-Judge

**Common Misconception:** AI comparisons are just subjective opinions without clear reasoning.

**The Reality:** Weighted evaluation rubrics produce objective, actionable assessments.

**Your Task:** You have two implementations below. The bad prompt just asks "which is better?" Fix it by:
1. Creating specific evaluation criteria
2. Adding weights to each criterion (e.g., Security 40%, Performance 30%, Readability 30%)
3. Requesting scores and structured comparison

See how rubrics transform vague opinions into actionable insights!

In [None]:
# Two implementations to compare
impl_a = "def hash_pwd(p): return hashlib.md5(p.encode()).hexdigest()"
impl_b = "def hash_pwd(p): return bcrypt.hashpw(p.encode(), bcrypt.gensalt())"

# ❌ BAD: Vague comparison request
bad_messages = [{"role": "user", "content": f"Which is better?\nA: {impl_a}\nB: {impl_b}"}]
bad_response = get_chat_completion(bad_messages)
print("=" * 70)
print("WITHOUT RUBRIC (Vague opinion):")
print("=" * 70)
print(bad_response)
print("\n")

# ✅ YOUR TURN: Create evaluation rubric
# TODO: Uncomment and complete
# good_messages = [{
#     "role": "system",
#     "content": """You are a code quality judge. Evaluate based on:
# - Security (40%): Resistance to attacks, proper crypto
# - Performance (30%): Speed, resource usage
# - Readability (30%): Clear, maintainable
# 
# Provide scores 0-10 for each, calculate weighted total, recommend best option."""
#     },
#     {
#         "role": "user",
#         "content": f"""Compare these password hashing implementations:
# 
# <implementation_a>
# {impl_a}
# </implementation_a>
# 
# <implementation_b>
# {impl_b}
# </implementation_b>
# 
# Provide:
# - Scores for each criterion
# - Weighted total scores
# - Recommendation with justification"""
#     }
# ]
# good_response = get_chat_completion(good_messages)
# print("=" * 70)
# print("WITH RUBRIC (Objective assessment):")
# print("=" * 70)
# print(good_response)
# print("\n💡 Rubric provides clear, actionable comparison with reasoning!")

---

### 🤫 Tactic 8: Inner Monologue

**Separate reasoning from clean final outputs**

**Core Principle:** The Inner Monologue technique guides AI models to articulate their thought process internally before delivering a final response, effectively "hiding" the reasoning steps from the end user. This is particularly useful when you want the benefits of chain-of-thought reasoning without exposing the intermediate thinking to users.

**Why Use Inner Monologue:**
- **Cleaner output:** Users see only the final answer, not the reasoning steps
- **Better reasoning:** The AI still benefits from step-by-step thinking internally
- **Professional presentation:** Provides concise, polished responses without verbose explanations
- **Flexible control:** You decide what to show and what to keep internal

**When to Use Inner Monologue:**
- Customer-facing applications where clean responses are important
- API responses that need to be concise
- Documentation generation where only conclusions matter
- Code generation where you want the code, not the thought process
- Production systems where token efficiency is critical

**How to Implement:**
1. **Instruct internal thinking:** Tell the AI to think through the problem internally
2. **Separate reasoning from output:** Use tags like `<thinking>` for internal reasoning and `<output>` for final results
3. **Extract final result:** Parse only the `<output>` section for user-facing display
4. **Optional logging:** Store the `<thinking>` section for debugging or quality assurance

Code generation tools, automated PR reviews, documentation generators, and customer-facing chatbots all benefit from intelligent responses without exposing the AI's reasoning process.


#### Example 1: Code Generation with Hidden Reasoning

Let's compare code generation with and without inner monologue:


In [None]:
# Example 1: WITHOUT Inner Monologue (verbose response)
print("=" * 70)
print("WITHOUT INNER MONOLOGUE (Verbose)")
print("=" * 70)

without_monologue = [
    {
        "role": "system",
        "content": "You are a Python developer helping with code generation."
    },
    {
        "role": "user",
        "content": """Create a function that validates email addresses using regex. 
It should check for proper format and common email providers."""
    }
]

response_verbose = get_chat_completion(without_monologue)
print(response_verbose)
print("\n")

# Example 1: WITH Inner Monologue (clean output)
print("=" * 70)
print("WITH INNER MONOLOGUE (Clean Output Only)")
print("=" * 70)

with_monologue = [
    {
        "role": "system",
        "content": """You are a Python developer. When solving problems:
1. Think through the requirements internally in <thinking> tags
2. Provide only the final code in <output> tags
3. Keep the output clean and production-ready"""
    },
    {
        "role": "user",
        "content": """Create a function that validates email addresses using regex. 
It should check for proper format and common email providers.

Think through the requirements internally, then provide only the final code."""
    }
]

response_clean = get_chat_completion(with_monologue)
print(response_clean)

# Extract only the output section (simulating production use)
import re
output_match = re.search(r'<output>(.*?)</output>', response_clean, re.DOTALL)
if output_match:
    print("\n" + "=" * 70)
    print("EXTRACTED FOR USER (Production Output)")
    print("=" * 70)
    print(output_match.group(1).strip())


#### Key Takeaways: Inner Monologue

**What We Demonstrated:**

**Example 1: Code Generation**
- **Without inner monologue:** AI provides verbose explanations mixed with code
- **With inner monologue:** AI thinks internally in `<thinking>` tags, outputs clean code in `<output>` tags
- **Production use:** Extract only the `<output>` section for user-facing applications

**Example 2: Bug Analysis**
- AI analyzes the bug internally (division by zero for empty list)
- Provides concise, actionable fix without lengthy explanation
- Perfect for automated bug-fixing tools or PR comments

**Benefits of Inner Monologue:**

1. **Best of Both Worlds:**
   - AI still benefits from step-by-step reasoning
   - Users get clean, concise results

2. **Production Ready:**
   - Responses are polished and professional
   - No verbose explanations cluttering the output
   - Token-efficient for cost-sensitive applications

3. **Flexible Control:**
   - Keep `<thinking>` for debugging and logging
   - Show `<output>` to end users
   - Audit AI reasoning when needed

4. **User Experience:**
   - Faster to read and understand
   - More professional appearance
   - Reduces cognitive load on users

**Real-World Applications:**

- **Code Generation Tools:** IDE extensions that generate clean code without explanations
- **Automated PR Reviews:** Concise comments on pull requests with reasoning logged separately
- **Documentation Generators:** Clean docs without showing the analysis process
- **Customer Support Bots:** Helpful answers without exposing decision trees
- **API Code Examples:** Clean, copy-paste ready code snippets
- **Debugging Assistants:** Direct fixes without lengthy troubleshooting narratives

**Implementation Pattern:**

```python
system_prompt = """
Process:
1. In <thinking> tags: Analyze, plan, consider edge cases
2. In <output> tags: Provide only the final result

Never show <thinking> to users - it's for your internal process only.
"""

# Then extract: 
output = extract_tag(response, 'output')  # Show to user
thinking = extract_tag(response, 'thinking')  # Log for debugging
```

**Pro Tip:** You can combine inner monologue with other tactics! Use it with prompt chaining for multi-step workflows where each step produces clean output, or with role prompting for specialized expert responses without verbose explanations.


---

### 🎯 Try It Yourself: Inner Monologue

**Common Misconception:** Users need to see all of AI's reasoning and thought process.

**The Reality:** Clean outputs with hidden reasoning provide better UX while maintaining quality.

**Your Task:** Below, AI generates code with verbose explanations mixed in. Fix it by:
1. Instructing AI to use `<thinking>` tags for internal reasoning
2. Using `<output>` tags for the final clean code
3. Extracting only the `<output>` section for the user

Perfect for production tools where users want code, not essays!

In [None]:
# ❌ BAD: Verbose output with explanations mixed in
bad_messages = [{
    "role": "user",
    "content": "Write a function to validate email addresses with regex."
}]
bad_response = get_chat_completion(bad_messages)
print("=" * 70)
print("VERBOSE OUTPUT (Explanations mixed with code):")
print("=" * 70)
print(bad_response)
print("\n")

# ✅ YOUR TURN: Use inner monologue to separate thinking from output
# TODO: Uncomment and complete
# good_messages = [{
#     "role": "system",
#     "content": """You are a code generator. Process:
# 1. In <thinking> tags: Plan the solution, consider edge cases
# 2. In <output> tags: Provide only the final, clean code

# Users see only <output>. Keep thinking internal."""
#     },
#     {
#         "role": "user",
#         "content": "Write a function to validate email addresses with regex."
#     }
# ]
# good_response = get_chat_completion(good_messages)

# # Extract clean output for user
# import re
# output_match = re.search(r'<output>(.*?)</output>', good_response, re.DOTALL)
# if output_match:
#     clean_output = output_match.group(1).strip()
#     print("=" * 70)
#     print("CLEAN OUTPUT (What user sees):")
#     print("=" * 70)
#     print(clean_output)
    
#     # Thinking is logged but not shown to user
#     thinking_match = re.search(r'<thinking>(.*?)</thinking>', good_response, re.DOTALL)
#     if thinking_match:
#         print("\n[Logged internally for debugging]:")
#         print(thinking_match.group(1).strip()[:200] + "...")

# print("\n💡 Users get clean code, you keep the reasoning for debugging!")

---

<div style="margin:20px 0; padding:16px 24px; background:linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); border-radius:10px; color:#8b4513; text-align:center; box-shadow:0 4px 15px rgba(252,182,159,0.3);">
  <strong style="font-size:1.05em;">🎉 All 8 tactics learned! Practice makes perfect.</strong><br>
  <span style="font-size:0.92em; opacity:0.95; margin-top:4px; display:block;">You've absorbed a lot—take a moment before diving into hands-on activities.</span>
</div>

---

<div style="margin:24px 0; padding:20px 24px; background:linear-gradient(135deg, #f8fafc 0%, #e2e8f0 100%); border-radius:12px; border-left:5px solid #10b981; box-shadow:0 2px 8px rgba(0,0,0,0.1);">
  <div style="color:#1e293b; font-size:0.85em; font-weight:600; text-transform:uppercase; letter-spacing:1px; margin-bottom:8px;">⏭️ Next Section</div>
  <div style="color:#0f172a; font-size:1.15em; font-weight:700; margin-bottom:6px;">Section 2.5: Hands-On Practice</div>
  <div style="color:#475569; font-size:0.95em; line-height:1.5; margin-bottom:12px;">Apply all 8 tactics independently in unguided practice activities with automated evaluation.</div>
  <a href="./2.5-hands-on-practice.ipynb" style="display:inline-block; padding:8px 16px; background:#10b981; color:#fff; text-decoration:none; border-radius:6px; font-weight:600; font-size:0.9em; transition:all 0.2s;">Continue to Section 2.5 →</a>
</div>