# Section 2.4: Advanced Workflows

| **Aspect** | **Details** |
|-------------|-------------|
| **Goal** | Master prompt chaining, tree of thoughts exploration, and LLM-as-judge evaluation tactics. |
| **Time** | ~25 minutes |
| **Prerequisites** | Complete Sections 2.1–2.3 and be comfortable with reasoning patterns. |
| **Next Steps** | Continue to Section 2.5: Practice and Solutions |

---
## 🔧 Quick Setup Check
Since you completed Sections 2.1-2.3, setup is already done! We just need to import it.

In [None]:
# Quick setup check - imports setup_utils
try:
    import importlib
    import setup_utils
    importlib.reload(setup_utils)
    from setup_utils import *
    print(f"✅ Setup loaded! Using {AVAILABLE_PROVIDERS} with {get_default_model()}")
    print("🚀 Ready to build automated workflows!")
except ImportError:
    print("❌ Setup not found!")
    print("💡 Please run 2.1-setup-and-foundations.ipynb first to set up your environment.")

---

### 🔗 Tactic 6: Prompt Chaining

**Break complex tasks into sequential workflows**

**Core Principle:** Complex tasks can cause AI to "drop the ball" if handled in a single prompt. Prompt chaining breaks tasks into smaller, manageable subtasks where each step gets focused attention.

#### 📚 Quick Overview: Prompt Chaining Approaches

Before diving into the details, here's a comparison of three approaches to handling complex tasks:

| Approach | When to Use | Key Benefit | Result |
|----------|-------------|-------------|--------|
| 🔴 **Single Prompt** | Simple, fast tasks | Speed and simplicity | One-shot output |
| 🔗 **Prompt Chain** | Complex, multi-step workflows | Quality through focused steps | Verified, high-quality output |
| 🔄 **Self-Correcting Chain** | Production code, critical tasks | Automated quality improvement | Auto-improved, validated output |

---

**Understanding Prompt Chaining Through Analogy**

Think of the difference between asking one person to do everything versus using an assembly line process.

**❌ Single Prompt: The Overwhelmed Solo Worker**

Imagine giving someone this massive task all at once:  
```code
Build a complete car from scratch: design it, create the engine, assemble the body, paint it, install electronics, and test everything—all in one go.
```

**Problems with this approach:**
- Too much to keep track of at once and results in loss of detail
- Quality suffers when juggling multiple complex tasks
- Early mistakes cascade through the entire process
- No chance to course-correct between steps
- Can't bring specialized expertise to each phase

**✅ Prompt Chaining: The Focused Assembly Line**

Now imagine breaking the same work into a sequence where each step has one clear job:

```code
Station 1: Design the car → passes blueprints to →
Station 2: Build the engine → passes engine to →
Station 3: Assemble the body → passes frame to →
Station 4: Paint and finish → passes to →
Station 5: Quality testing → delivers final product
```

**Benefits of this approach:**
- **Focus:** Each step tackles one thing with full attention
- **Quality:** Specialized expertise at each stage improves results
- **Control:** You can inspect and adjust between steps
- **Prevention:** Errors are caught early before they spread
- **Building blocks:** Each step creates verified work for the next

**The Key Insight:** Just like in manufacturing, breaking complex AI tasks into focused steps produces better results than trying to do everything at once.

**Single Prompt Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #666; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="text-align: center; margin-bottom: 15px;">
    <div style="background-color: #4A90E2; color: white; padding: 10px; border-radius: 5px; display: inline-block; font-size: 14px; font-weight: bold;">
      Input (Complex Request)
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #333;">↓</div>
  <div style="text-align: center; margin: 15px 0;">
    <div style="background-color: #E8E8E8; padding: 15px; border-radius: 5px; display: inline-block; min-width: 200px; color: #333;">
      <strong style="font-size: 14px;">[AI Processing]</strong>
      <div style="margin-top: 10px; text-align: left; font-size: 14px; line-height: 1.6;">
        - Task A<br/>
        - Task B<br/>
        - Task C<br/>
        - Task D
      </div>
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #50C878; color: white; padding: 10px; border-radius: 5px; display: inline-block; font-size: 14px; font-weight: bold;">
      Output (All Results at Once)
    </div>
  </div>
</div>

Characteristics:

- Linear, one-shot processing
- No intermediate verification
- Context dilution across tasks
- Limited depth per subtask

**Prompt Chaining Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #666; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 1</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task A]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 1</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(becomes input)</span></div>
  <div style="margin: 15px 0; padding-left: 40px;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 2</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task B]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 2</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 40px; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(becomes input)</span></div>
  <div style="margin: 15px 0; padding-left: 80px;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input 3</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[AI Task C]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output 3</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 80px; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #FFD700; color: #333; padding: 10px 20px; border-radius: 5px; display: inline-block; font-weight: bold; font-size: 14px;">
      Final Result
    </div>
  </div>
</div>

**Characteristics:**

* Sequential, focused processing
* Verification points between stages
* Context accumulation and refinement
* Deep specialization per task

**Practical Example: Code Review**

Let's see this in action with a real software engineering task.

**❌ Single Prompt Approach:**

```text
Review this authentication service code for security issues, performance problems, code quality, and generate fixes with tests.
```

**Problems:**
- AI must juggle security analysis + performance review + quality check + fix generation
- May miss subtle vulnerabilities while thinking about tests
- Fixes generated without thorough analysis
- No opportunity to validate findings before committing to solutions

**✅ Prompt Chain Approach:**

```text
Chain 1: "Analyze this authentication code for security vulnerabilities only"
         → Produces: List of security issues with severity ratings
         
Chain 2: "Review the same code for performance bottlenecks"
         → Produces: Performance analysis with metrics
         
Chain 3: "Evaluate code quality and maintainability"
         → Produces: Quality assessment
         
Chain 4: "Based on ALL the analyses above, generate prioritized fixes"
         → Produces: Comprehensive solutions addressing all issues
         
Chain 5: "Create tests for the fixed implementation"
         → Produces: Test suite validating the improvements
```

**Benefits:**
- Each analysis gets full attention without distraction
- You can verify findings at each stage
- Later steps have complete context from earlier work
- Higher quality, more thorough results

---

### When to Use Each Approach

**Use Single Prompt When:**
- ✅ Task is simple and self-contained
- ✅ You need a quick, one-off response
- ✅ The request has no interdependent parts
- ✅ Speed matters more than depth
- ✅ Example: "Format this JSON" or "Explain what this function does"

**Use Prompt Chaining When:**
- 🔗 Task is complex with multiple stages
- 🔗 Quality and accuracy are critical
- 🔗 You need to verify intermediate results
- 🔗 Later steps depend on earlier outputs
- 🔗 You want to iterate and refine
- 🔗 Working with large codebases or documentation
- 🔗 Examples: Multi-file refactoring, comprehensive security audits, architecture decisions

---

### Implementation Best Practices

**How to Chain Effectively:**
1. **Identify subtasks:** Break task into distinct, sequential steps
2. **Use XML tags:** Pass outputs between prompts with structured tags like `<analysis>`, `<review>`, `<code>`
3. **Single objective:** Each step has one clear goal
4. **Iterate:** Refine problematic steps without redoing entire chain
5. **Pass context forward:** Each step receives relevant outputs from previous steps

**Common Patterns from Real Workflows:**
- **Content Creation:** Research → Outline → Draft → Edit → Format
- **Code Review:** Analyze → Rate Severity → Generate Fixes → Validate
- **Document Analysis:** Extract key points → Summarize → Generate insights
- **Self-Correction:** Generate → Review own work → Improve → Verify

---

### Advanced Pattern - Self-Correction Chains

**Want to take prompt chaining further?** Create chains where AI reviews and improves its own work!

**The Analogy:** Like a manuscript going through editorial revisions:
- **Regular Chain:** Draft → Edit → Publish (one pass)
- **Self-Correcting:** Draft → Critique → Revise → Verify → Publish (feedback loops)

**The 3-Step Pattern:**
```
1. Generate → AI creates initial solution
2. Critique → AI reviews its OWN work for issues
3. Improve → AI fixes identified problems
```

**Self-Correcting Architecture**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #9370DB; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[Generate Draft]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #FFB347; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Draft v1</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; color: #333;">↓</div>
  <div style="margin: 15px 0; padding-left: 40px;">
    <span style="background-color: #9370DB; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[Critique & Identify Issues]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #DC143C; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Issues List</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 40px; color: #333;">↓ <span style="font-size: 12px; font-style: italic;">(feedback loop)</span></div>
  <div style="margin: 15px 0; padding-left: 80px;">
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">[Revise Based on Critique]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #32CD32; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Improved v2</span>
  </div>
  <div style="text-align: center; font-size: 18px; margin: 10px 0; padding-left: 80px; color: #333;">↓</div>
  <div style="text-align: center; margin-top: 15px;">
    <div style="background-color: #FFD700; color: #333; padding: 10px 20px; border-radius: 5px; display: inline-block; font-weight: bold; font-size: 14px;">
      Final Verified Output
    </div>
  </div>
</div>

**When to Use:**
- ✅ Production code generation (high quality needed)
- ✅ Critical analysis tasks (accuracy matters)
- ✅ Automated QA workflows (no human review available)

**Trade-offs:**
- ⚠️ More API calls = Higher cost
- ⚠️ Longer execution time
- ✅ Best for quality over speed

<div style="margin:16px 0; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b; color:#78350f;">
<strong>💡 Pro Tip:</strong> The example below shows a 4-step self-correction with quality scoring. Run it to see how AI improves its own code!
</div>

In [None]:
# Shared setup helpers (run Section 2.1 first to install dependencies)
from setup_utils import get_chat_completion


#### 📦 Helper Functions Setup

**What this cell does:**  
Defines two utility functions we'll use throughout the self-correcting chain example:
- `extract_between_tags()` - Extracts content from XML tags like `<code>`, `<issues>`, etc.
- `print_section()` - Formats output with colored headers for better readability

**Why run this first:**  
These helpers need to be loaded before the main function can use them.

**Expected output:** No output (functions are defined and ready to use)

In [None]:
# Self-Correcting Chain Example: Email Validator Code

# Helper functions for parsing XML tags and formatting output
def extract_between_tags(text: str, tag: str) -> str:
    """Extract content between XML-style tags."""
    start_tag = f"<{tag}>"
    end_tag = f"</{tag}>"
    
    if start_tag in text and end_tag in text:
        start = text.find(start_tag) + len(start_tag)
        end = text.find(end_tag)
        return text[start:end].strip()
    return text

def print_section(title: str, content: str = "", color_code: str = "96"):
    """Print formatted section with color and borders."""
    print(f"\n\033[{color_code}m{'=' * 70}")
    print(f"{title}")
    print(f"{'=' * 70}\033[0m")
    if content:
        print(content)

#### 🔧 Main Self-Correcting Chain Function

**What this cell does:**  
Defines the core `self_correcting_chain()` function that implements the 4-step process:
1. Generate initial code
2. Self-review with detailed criteria
3. Self-improve based on critique
4. Final quality verification

**Why this matters:**  
This is the reusable function you can adapt for your own projects. Study the structure of each step!

**Expected output:** No output (function is defined and ready to call)

In [None]:
def self_correcting_chain(requirement: str) -> dict:
    """
    Execute a self-correcting chain with automatic quality improvement.
    
    Args:
        requirement: The task description
        
    Returns:
        Dictionary containing all stages of the process
    """
    results = {}
    
    # STEP 1: Generate Initial Solution
    print_section("STEP 1: Generate Initial Solution", "", "94")
    
    initial_messages = [{
        "role": "user",
        "content": f"""Task: {requirement}

Requirements:
- Provide complete, working code
- Include docstrings and comments
- Handle edge cases
- Use best practices

Wrap your code in <code></code> tags."""
    }]
    
    initial_response = get_chat_completion(initial_messages)
    initial_code = extract_between_tags(initial_response, "code")
    results['initial'] = initial_code
    
    print(initial_response)
    
    # STEP 2: Self-Review with Detailed Criteria
    print_section("STEP 2: AI Self-Review & Critique", "", "93")
    
    critique_messages = [{
        "role": "user",
        "content": f"""You previously wrote this code:

<code>
{initial_code}
</code>

Now, critically review YOUR OWN code using these criteria:

**Security:**
- Input validation vulnerabilities
- Injection risks
- Data exposure issues

**Correctness:**
- Edge cases (empty strings, None, special characters)
- Boundary conditions
- Logic errors

**Code Quality:**
- Error handling completeness
- Code clarity and maintainability
- Performance considerations
- Following Python best practices (PEP 8)

Provide your analysis in this format:
<issues>
[List specific issues found, or write "No critical issues found"]
</issues>

<severity>
[Rate overall: CRITICAL / MODERATE / MINOR / NONE]
</severity>

<recommendations>
[Specific improvements to make]
</recommendations>"""
    }]
    
    critique_response = get_chat_completion(critique_messages)
    issues = extract_between_tags(critique_response, "issues")
    severity = extract_between_tags(critique_response, "severity")
    recommendations = extract_between_tags(critique_response, "recommendations")
    
    results['critique'] = {
        'full_response': critique_response,
        'issues': issues,
        'severity': severity,
        'recommendations': recommendations
    }
    
    print(critique_response)
    
    # STEP 3: Self-Improvement Based on Critique
    print_section("STEP 3: AI Self-Improvement", "", "92")
    
    improve_messages = [{
        "role": "user",
        "content": f"""Your original code:
<original_code>
{initial_code}
</original_code>

Your self-review identified these issues:
<issues>
{issues}
</issues>

Severity: {severity}

Recommendations:
<recommendations>
{recommendations}
</recommendations>

Now improve your code by:
1. Fixing all identified issues
2. Implementing the recommendations
3. Adding comprehensive error handling
4. Including detailed docstrings

Provide the improved code in <improved_code></improved_code> tags.
Also explain what you changed in <changes></changes> tags."""
    }]
    
    improved_response = get_chat_completion(improve_messages)
    improved_code = extract_between_tags(improved_response, "improved_code")
    changes = extract_between_tags(improved_response, "changes")
    
    results['improved'] = {
        'code': improved_code,
        'changes': changes,
        'full_response': improved_response
    }
    
    print(improved_response)
    
    # STEP 4: Final Verification
    print_section("STEP 4: Final Quality Verification", "", "95")
    
    verify_messages = [{
        "role": "user",
        "content": f"""Review this final code and confirm all issues are resolved:

<code>
{improved_code}
</code>

Provide:
<verification_status>PASS or FAIL</verification_status>
<remaining_issues>List any remaining issues, or "None"</remaining_issues>
<quality_score>Rate 1-10</quality_score>"""
    }]
    
    verification_response = get_chat_completion(verify_messages)
    status = extract_between_tags(verification_response, "verification_status")
    remaining = extract_between_tags(verification_response, "remaining_issues")
    score = extract_between_tags(verification_response, "quality_score")
    
    results['verification'] = {
        'status': status,
        'remaining_issues': remaining,
        'quality_score': score,
        'full_response': verification_response
    }
    
    print(verification_response)
    
    return results

#### 🚀 Execute the Self-Correcting Chain

**What this cell does:**  
Runs the complete self-correcting chain demonstration with:
- **Requirement:** Create an email validator function
- **Process:** AI generates → critiques → improves → verifies
- **Summary:** Shows before/after comparison with quality scores

**⏱️ Execution time:** ~30-60 seconds (makes 4 API calls)

**What to watch for:**
- Step 1: Initial code (may have issues)
- Step 2: AI identifies its own mistakes
- Step 3: Improved code addressing all issues
- Step 4: Final quality score (should be 8-10/10)

**💡 Pro tip:** Compare the initial vs improved code to see the power of self-correction!

In [None]:
# Run the self-correcting chain
requirement = "Function to validate email addresses with comprehensive checks"

print("\033[96m" + "=" * 70)
print("🔄 SELF-CORRECTING CHAIN DEMONSTRATION")
print("=" * 70 + "\033[0m")
print(f"Requirement: {requirement}\n")

results = self_correcting_chain(requirement)

# Print summary
print_section("📊 EXECUTION SUMMARY", "", "96")

print(f"""
Initial Code Length:     {len(results['initial'])} characters
Issues Identified:       {results['critique']['severity']}
Improved Code Length:    {len(results['improved']['code'])} characters
Verification Status:     {results['verification']['status']}
Quality Score:           {results['verification']['quality_score']}/10

Key Changes Made:
{results['improved']['changes']}

Remaining Issues:
{results['verification']['remaining_issues']}
""")

#### Key Takeaways: Prompt Chaining

**The Core Idea:**
- **Single Prompt** = One person doing everything (overwhelmed)
- **Prompt Chaining** = Assembly line where each station specializes (focused)
- **Self-Correcting** = Assembly line + quality inspectors at each stage (verified)

**When to Chain:**
- ✅ Complex tasks with multiple stages
- ✅ Quality matters more than speed
- ✅ Need to verify intermediate results
- ✅ Later steps depend on earlier outputs

**How to Chain:**
1. Break into sequential steps (Analyze → Fix → Test)
2. Pass outputs forward using XML tags (`<analysis>`, `<code>`)
3. One clear goal per step
4. Use specialized roles per step (security engineer → QA)

**Common Patterns:**
- **Code Review:** Analyze → Rate severity → Generate fixes → Validate
- **Self-Correction:** Generate → Critique own work → Improve → Verify
- **Security Audit:** Scan → Prioritize → Fix → Verify

<div style="margin:16px 0; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b; color:#78350f;">
<strong>💡 Pro Tip:</strong> If a step performs poorly, isolate and fix just that step without redoing the entire chain.
</div>

*Reference: [Claude Documentation - Chain Complex Prompts](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/chain-prompts)*

---

### 🎯 Try It Yourself: Prompt Chaining

**Common Misconception:** AI can handle multiple complex tasks in one prompt just as well as breaking them into steps.

**The Reality:** Chaining gives each task full attention, dramatically improving quality and reducing errors.

**Your Task:** Below is code with security and validation issues. The BAD prompt asks AI to do everything at once (review + fix + tests). 

Your task is to:

1. First, run the cell to see the BAD single-prompt approach
2. Then, uncomment the GOOD 3-step chain and run again
3. Compare the depth and quality of both approaches

The GOOD approach breaks it into:
- **Step 1:** Analyze issues only
- **Step 2:** Fix code based on analysis
- **Step 3:** Generate tests for fixed code

See how focused steps produce better results!

#### 💻 Run This Cell to Compare Approaches

**What this cell does:**  
Shows the BAD approach (single prompt doing everything at once), then you'll uncomment the GOOD approach to see the difference.

**⏱️ Execution time:** ~10 seconds for BAD, ~30 seconds for GOOD (when uncommented)

**Your task:**
1. Run as-is to see the single-prompt approach
2. Uncomment the GOOD section (remove the `#` symbols)
3. Run again to see the 3-step chain approach
4. Compare the depth and quality of both outputs

In [None]:
# Code with multiple issues
problematic_code = """
def process_payment(amount, card_number):
    if amount > 0:
        charge_card(card_number, amount)
        return "success"
"""

# ❌ BAD: Everything at once
print("=" * 70)
print("BAD: Single Prompt (Everything at Once)")
print("=" * 70)

bad_messages = [{
    "role": "user",
    "content": f"Review this code, fix all issues, and write tests:\n\n{problematic_code}"
}]

bad_response = get_chat_completion(bad_messages)
print(bad_response[:400] + "...\n")  # Show truncated output

# ✅ YOUR TURN: Uncomment to see the focused 3-step approach
#
# print("=" * 70)
# print("GOOD: Prompt Chain (3 Focused Steps)")
# print("=" * 70)
#
# # STEP 1: Analyze
# step1 = get_chat_completion([{
#     "role": "user",
#     "content": f"Analyze this code for issues:\n{problematic_code}\n\nList: security, validation, error handling."
# }])
# print("\n🔍 STEP 1 - Analysis:")
# print(step1)
#
# # STEP 2: Fix based on analysis
# step2 = get_chat_completion([{
#     "role": "user",
#     "content": f"Fix this code based on analysis:\n\nCode: {problematic_code}\n\nIssues: {step1}"
# }])
# print("\n🔧 STEP 2 - Fixed Code:")
# print(step2)
#
# # STEP 3: Generate tests
# step3 = get_chat_completion([{
#     "role": "user",
#     "content": f"Write tests for:\n{step2}\n\nInclude: happy path, edge cases, error handling."
# }])
# print("\n✅ STEP 3 - Tests:")
# print(step3)
#
# print("\n" + "=" * 70)
# print("💡 COMPARISON:")
# print("  Single: Tries everything → shallow analysis")
# print("  Chain: One focus per step → thorough results")
# print("=" * 70)

---

<div style="margin:20px 0; padding:16px 24px; background:linear-gradient(135deg, #f093fb 0%, #f5576c 100%); border-radius:10px; color:#fff; text-align:center; box-shadow:0 4px 15px rgba(240,147,251,0.3);">
  <strong style="font-size:1.05em;">🌟 Excellent progress! Deep learning requires breathing room.</strong><br>
  <span style="font-size:0.92em; opacity:0.95; margin-top:4px; display:block;">Step away for a few minutes—your brain will thank you when you tackle the final tactics.</span>
</div>

---

---

### 🌳 Tactic 7: Tree of Thoughts

**Explore multiple reasoning paths and select the best solution**

**Core Principle:** Tree of Thoughts (ToT) extends chain-of-thought reasoning by exploring multiple solution paths simultaneously, evaluating each approach, and selecting or combining the best ideas. Instead of following a single linear path, the AI generates multiple candidate solutions, reasons about their trade-offs, and chooses the optimal approach.

**Why Use Tree of Thoughts:**
- **Better solutions:** Exploring alternatives finds approaches you might miss with single-path thinking
- **Trade-off analysis:** Compare different solutions on multiple dimensions (speed, maintainability, scalability)
- **Reduced blind spots:** Multiple perspectives catch issues that single-path reasoning misses
- **Informed decisions:** Explicit comparison of alternatives with clear justification

**When to Use Tree of Thoughts:**
- Architecture decisions with multiple viable approaches
- Algorithm selection when performance characteristics matter
- Design pattern choices where trade-offs exist
- Complex problem-solving requiring creative alternatives
- Code optimization with competing objectives

**How to Implement:**
1. **Generate alternatives:** Ask AI to produce 2-4 different solution approaches
2. **Evaluate each path:** Assess pros/cons, complexity, performance for each option
3. **Compare explicitly:** Use structured comparison (table, scoring) across criteria
4. **Select or synthesize:** Choose the best approach or combine strengths from multiple paths
5. **Justify decision:** Explain why the selected approach is optimal for your context

**Real-world applications:** Choosing between microservices vs. monolith, selecting data structures for performance, evaluating caching strategies, or comparing authentication approaches all benefit from tree-of-thoughts exploration.

*Reference: [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601)*

#### Example 1: Comparing Algorithm Approaches

This example shows basic tree-of-thoughts: generate multiple solutions, evaluate trade-offs, select the best.

**Scenario:** Find duplicates in a large dataset

<div style="background: #f8fafc; padding: 25px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); margin: 20px 0;">

<div style="text-align: center; margin-bottom: 20px;">
<div style="display: inline-block; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 24px; border-radius: 8px; font-weight: bold; font-size: 16px;">
🌳 Task: Find Duplicates in 1M Records
</div>
</div>

<!-- STEP 1: Generate Multiple Paths -->
<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #3b82f6; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">
<table style="width: 100%; border: none;">
<tr>
<td style="width: 70px; vertical-align: top; text-align: center; border: none;">
<div style="width: 55px; height: 55px; background: #3b82f6; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(59, 130, 246, 0.3);">1</div>
</td>
<td style="padding-left: 15px; border: none;">
<div style="font-weight: 700; font-size: 17px; color: #1e40af; margin-bottom: 6px;">🌿 Generate Alternative Approaches</div>
<div style="font-size: 15px; color: #475569; line-height: 1.6;">
<strong>Prompt:</strong> "Generate 3 different approaches to find duplicates"<br>
<strong>Result:</strong> Approach A (nested loops), Approach B (hash set), Approach C (sorting)
</div>
</td>
</tr>
</table>
</div>

<!-- STEP 2: Evaluate Each Path -->
<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #f59e0b; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">
<table style="width: 100%; border: none;">
<tr>
<td style="width: 70px; vertical-align: top; text-align: center; border: none;">
<div style="width: 55px; height: 55px; background: #f59e0b; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(245, 158, 11, 0.3);">2</div>
</td>
<td style="padding-left: 15px; border: none;">
<div style="font-weight: 700; font-size: 17px; color: #78350f; margin-bottom: 6px;">⚖️ Evaluate Trade-offs</div>
<div style="font-size: 15px; color: #475569; line-height: 1.6;">
<strong>Prompt:</strong> "Analyze time complexity, space complexity, readability"<br>
<strong>Result:</strong> A: O(n²) time ❌ | B: O(n) time, O(n) space ✅ | C: O(n log n) time ⚠️
</div>
</td>
</tr>
</table>
</div>

<!-- STEP 3: Select Best Path -->
<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #10b981; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">
<table style="width: 100%; border: none;">
<tr>
<td style="width: 70px; vertical-align: top; text-align: center; border: none;">
<div style="width: 55px; height: 55px; background: #10b981; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(16, 185, 129, 0.3);">3</div>
</td>
<td style="padding-left: 15px; border: none;">
<div style="font-weight: 700; font-size: 17px; color: #047857; margin-bottom: 6px;">🎯 Select Optimal Solution</div>
<div style="font-size: 15px; color: #475569; line-height: 1.6;">
<strong>Decision:</strong> Choose Approach B (hash set)<br>
<strong>Rationale:</strong> Best time complexity, acceptable memory usage, clean code
</div>
</td>
</tr>
</table>
</div>

</div>

<div style="background: linear-gradient(135deg, #e0f2fe 0%, #bae6fd 100%); border-left: 4px solid #0284c7; padding: 18px; border-radius: 8px; margin: 20px 0;">
<div style="font-weight: 700; color: #0c4a6e; margin-bottom: 10px; font-size: 16px;">💡 Why This Works Better Than Single-Path</div>
<table style="width: 100%; border: none; font-size: 14px; color: #0c4a6e;">
<tr>
<td style="width: 50%; padding-right: 15px; border: none; vertical-align: top;">
<strong>❌ Single Path:</strong><br>
"Write code to find duplicates"<br><br>
→ AI picks first idea<br>
→ May not be optimal<br>
→ No comparison
</td>
<td style="width: 50%; padding-left: 15px; border: none; vertical-align: top; border-left: 2px solid #38bdf8;">
<strong>✅ Tree of Thoughts:</strong><br>
Explore 3 approaches<br><br>
→ AI generates alternatives<br>
→ Evaluates trade-offs<br>
→ Picks best solution<br>
→ You understand WHY!
</td>
</tr>
</table>
</div>

---

**Now let's see it in action! 👇 Run the code below:**

#### 💻 Step 1: Generate Alternative Approaches

**What this step does:**  
Asks the AI to propose 3 completely different algorithms for finding duplicates in 1M records.

**Expected output:**
- Approach 1: Nested loops method
- Approach 2: Hash set method  
- Approach 3: Sorting method
- Each with time/space complexity analysis

**⏱️ Execution time:** ~8-10 seconds

**💡 Watch for:** How AI naturally proposes different data structures and approaches

In [None]:
# Tree of Thoughts Example - STEP 1: Generate Alternatives

problem = "Find duplicate records in a dataset of 1 million user records"

print("=" * 70)
print("STEP 1: Generate Alternative Approaches")
print("=" * 70)
print("Asking AI to propose 3 different solutions...")
print()

generate_alternatives = [{
    "role": "user",
    "content": f"""Problem: {problem}

Generate 3 DIFFERENT approaches to solve this problem. For each approach, provide:
1. Algorithm/data structure used
2. High-level implementation strategy
3. Time and space complexity

Use this format:
<approach_1>
Name: [descriptive name]
Algorithm: [approach]
Strategy: [how it works]
Time: O(?)
Space: O(?)
</approach_1>

Provide approaches 1, 2, and 3."""
}]

alternatives = get_chat_completion(generate_alternatives)
print(alternatives)
print()

#### 💻 Step 2: Evaluate Trade-offs

**What this step does:**  
Evaluates each of the 3 approaches against specific criteria with scores 1-10.

**Evaluation criteria:**
- Performance (time complexity impact)
- Memory (space complexity)
- Scalability (handles growth)
- Code Complexity (maintainability)
- Edge Cases (robustness)

**Expected output:** Comparison table with scores for each approach

**⏱️ Execution time:** ~10-12 seconds

**💡 Watch for:** How different approaches excel in different areas (no perfect solution!)

In [None]:
# Tree of Thoughts Example - STEP 2: Evaluate Trade-offs

print("=" * 70)
print("STEP 2: Evaluate Trade-offs")
print("=" * 70)
print("Analyzing pros/cons of each approach...")
print()

evaluate_tradeoffs = [{
    "role": "user",
    "content": f"""You proposed these 3 approaches:

{alternatives}

Now evaluate each approach on these criteria:
- **Performance:** Time complexity impact on 1M records
- **Memory:** Space complexity and memory usage
- **Scalability:** How it handles growing data
- **Code Complexity:** Ease of implementation and maintenance
- **Edge Cases:** How well it handles duplicates, null values, etc.

Provide structured comparison in a table format with scores 1-10."""
}]

evaluation = get_chat_completion(evaluate_tradeoffs)
print(evaluation)
print()

#### 💻 Step 3: Select Best Approach

**What this step does:**  
Based on the evaluation scores, AI selects the optimal approach and provides:
- Clear justification for the choice
- Why it wins over the alternatives
- Python implementation of the selected approach

**Expected output:** 
- Selected approach name
- Detailed justification
- Production-ready code

**⏱️ Execution time:** ~10-12 seconds

**💡 Watch for:** The reasoning process - how AI weighs trade-offs to make the final decision

In [None]:
# Tree of Thoughts Example - STEP 3: Select Best Approach

print("=" * 70)
print("STEP 3: Select Best Approach")
print("=" * 70)
print("AI will now select the optimal solution...")
print()

select_best = [{
    "role": "user",
    "content": f"""Based on your evaluation:

{evaluation}

Select the BEST approach for finding duplicates in 1M records. Provide:

<selected_approach>[Approach name]</selected_approach>

<justification>
Why this is optimal:
- Performance advantage: [specific reason]
- Memory trade-off: [acceptable/optimal]
- Best for: [use case specifics]
- Wins on: [key criteria]
</justification>

<implementation>
[Provide Python code for the selected approach]
</implementation>"""
}]

selection = get_chat_completion(select_best)
print(selection)

# Summary
print("\n" + "=" * 70)
print("💡 KEY INSIGHT: Tree of Thoughts")
print("=" * 70)
print("""
This pattern creates better decisions through exploration:

✓ Step 1 (Generate): AI proposes 3 alternative solutions
✓ Step 2 (Evaluate): AI analyzes trade-offs objectively
✓ Step 3 (Select): AI chooses optimal approach with reasoning

Benefits:
• Avoids "first idea" bias - explores alternatives
• Explicit trade-off analysis reveals hidden costs
• Justified decisions - you understand WHY
• Better solutions through structured comparison

Use cases:
• Algorithm selection (which data structure?)
• Architecture decisions (microservices vs monolith?)
• Library choices (React vs Vue vs Svelte?)
• Design patterns (singleton vs factory vs DI?)

Why it works: Parallel exploration beats linear thinking!
""")

#### Key Takeaways: Tree of Thoughts

**The Core Idea:**
- **Single Prompt** = First solution that works (may not be optimal)
- **Prompt Chaining** = Sequential steps following one path
- **Tree of Thoughts** = Explore multiple paths → Compare → Choose best



**Why Tree of Thoughts Matters:**
Tree of Thoughts is fundamentally different from the other tactics because it explores **alternatives in parallel** rather than following a single path. Think of it as:

- **Prompt Chaining** = GPS navigation (one optimized route)
- **Tree of Thoughts** = Exploring multiple routes, comparing travel times, picking the fastest
---
#### 🔀 When to Use: Tactic 6 vs Tactic 7

<div style="background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%); padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 4px solid #f59e0b;">
<div style="font-weight: 700; font-size: 18px; color: #78350f; margin-bottom: 15px;">

🤔 The Decision Framework

</div>
<table style="width: 100%; border: none; font-size: 14px; color: #78350f;">
<tr>
<td style="width: 50%; padding: 15px; background: rgba(255,255,255,0.7); border-radius: 8px; border: none; vertical-align: top;">
<strong>🔗 Use Prompt Chaining When:</strong><br><br>

* Clear, sequential process<br>
* Steps are dependent on each other<br>
* One solution approach is known<br>
* Efficiency matters (fewer API calls)<br>
* Quality improvement needed (self-correcting)<br><br>
<strong>Examples:</strong> Data pipelines, code refactoring, security audits, report generation

</td>
<td style="width: 50%; padding: 15px; background: rgba(255,255,255,0.7); border-radius: 8px; border: none; vertical-align: top;">
<strong>🌳 Use Tree of Thoughts When:</strong><br><br>

* Multiple valid approaches exist<br>
* No obvious "best" solution<br>
* Trade-offs need explicit comparison<br>
* Quality > cost justification<br>
* Creative exploration adds value<br><br>
<strong>Examples:</strong> Architecture decisions, algorithm selection, design patterns, technology choices
</td>
</tr>
</table>
</div>

#### Visual Comparison

**Prompt Chaining: Linear Path**

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #3b82f6; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0; text-align: center;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Input</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">Step 1</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">Step 2</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #E8E8E8; padding: 8px 12px; border-radius: 5px; color: #333; font-size: 14px; font-weight: bold;">Step 3</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #50C878; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">Output</span>
  </div>
  <div style="text-align: center; margin-top: 15px; font-style: italic; color: #666; font-size: 13px;">
    Following a proven recipe step-by-step
  </div>
</div>

**Tree of Thoughts: Branching Exploration**
<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #10b981; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="text-align: center; margin-bottom: 15px;">
    <div style="background-color: #4A90E2; color: white; padding: 10px; border-radius: 5px; display: inline-block; font-size: 14px; font-weight: bold;">
      Input
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #333;">↓</div>
  <!-- Three parallel branches -->
  <div style="display: flex; justify-content: center; gap: 20px; margin: 15px 0;">
    <div style="text-align: center; flex: 1;">
      <div style="background-color: #93C5FD; color: #1e40af; padding: 10px; border-radius: 5px; font-size: 13px; font-weight: bold;">
        Approach A
      </div>
      <div style="font-size: 20px; margin: 8px 0; color: #333;">↓</div>
      <div style="background-color: #FEF3C7; color: #78350f; padding: 8px; border-radius: 5px; font-size: 12px;">
        Eval: 7/10
      </div>
    </div>
    <div style="text-align: center; flex: 1;">
      <div style="background-color: #93C5FD; color: #1e40af; padding: 10px; border-radius: 5px; font-size: 13px; font-weight: bold;">
        Approach B
      </div>
      <div style="font-size: 20px; margin: 8px 0; color: #333;">↓</div>
      <div style="background-color: #BBF7D0; color: #065f46; padding: 8px; border-radius: 5px; font-size: 12px; font-weight: bold;">
        Eval: 9/10 ⭐
      </div>
    </div>
    <div style="text-align: center; flex: 1;">
      <div style="background-color: #93C5FD; color: #1e40af; padding: 10px; border-radius: 5px; font-size: 13px; font-weight: bold;">
        Approach C
      </div>
      <div style="font-size: 20px; margin: 8px 0; color: #333;">↓</div>
      <div style="background-color: #FEF3C7; color: #78350f; padding: 8px; border-radius: 5px; font-size: 12px;">
        Eval: 6/10
      </div>
    </div>
  </div>
  <div style="text-align: center; font-size: 24px; margin: 10px 0; color: #10b981;">↓</div>
  <div style="text-align: center; margin: 15px 0;">
    <div style="background-color: #10B981; color: white; padding: 10px 20px; border-radius: 5px; display: inline-block; font-weight: bold; font-size: 14px;">
      Choose Best (B)
    </div>
  </div>
  <div style="text-align: center; margin-top: 15px; font-style: italic; color: #666; font-size: 13px;">
    Trying multiple recipes, picking the best
  </div>
</div>

#### Key Differences

| Aspect | 🔗 Prompt Chaining | 🌳 Tree of Thoughts |
|--------|-------------------|---------------------|
| **Path** | Single sequential | Multiple parallel |
| **Comparison** | No alternatives | Evaluates options |
| **API Calls** | 3-5 calls | 9-15+ calls |
| **Cost** | $$ | $$$ |
| **Best For** | Known process | Uncertain path |

#### Performance & Cost Reality
**Typical Improvements:**

- **Prompt Chaining**: 2-3x better than single prompt
- **Tree of Thoughts**: 1.2-1.5x better than chaining

**Cost Trade-off:**
- **Chaining**: 3-5 API calls (efficient)
- **ToT**: 9-15+ API calls (2-3x more expensive)

<div style="background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%); padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #f59e0b; box-shadow: 0 2px 8px rgba(245, 158, 11, 0.3);">
<div style="font-weight: 700; font-size: 18px; color: #78350f; margin-bottom: 12px;">
📊 The 90/10 Rule
</div>
<div style="font-size: 15px; color: #78350f; line-height: 1.7;">
Use <strong>Prompt Chaining</strong> (Tactic 6) for <strong style="background: white; padding: 2px 8px; border-radius: 4px;">90% of your tasks</strong>.<br><br>
Reserve <strong>Tree of Thoughts</strong> (Tactic 7) for the <strong style="background: white; padding: 2px 8px; border-radius: 4px;">10% where quality justifies the cost</strong>:
<ul style="margin-top: 10px; margin-bottom: 0;">
<li>Architecture decisions with long-term impact</li>
<li>Key algorithm choices affecting performance</li>
<li>Critical design patterns used throughout system</li>
<li>Technology selections (databases, frameworks)</li>
</ul>
</div>
</div>

---



#### The 3-Step Tree of Thoughts Pattern

1. **Generate alternatives** (3-4 different approaches)

   ```python
   options = [generate(input, approach=i) for i in range(3)]
   ```

2. **Evaluate explicitly** (score each on key criteria)

   ```python
   scores = [evaluate(opt, criteria=['performance', 'complexity', 'cost']) 
             for opt in options]
   ```

3. **Select with justification** (explain why X beats Y and Z)

   ```python
   best = options[scores.index(max(scores))]
   ```

### Evaluation Criteria Examples

For algorithm selection:

- **Performance:** Time/space complexity
- **Scalability:** Growth handling
- **Maintainability:** Code clarity
- **Resources:** Memory/CPU usage

For architecture decisions:

- **Cost:** Infrastructure expenses
- **Complexity:** Development time
- **Reliability:** Error handling
- **Future-proof:** Adaptability

#### Real-World Example: Caching Strategy
| Aspect | 🔗 Chaining | 🌳 Tree of Thoughts |
|--------|-------------|---------------------|
| **Flow** | Linear: A → B → C → D | Branching: Try A, B, C → Pick best |
| **Output** | Single solution | Best of multiple options |
| **API Calls** | 4 | 5-7 |
| **Best for** | Known path, efficiency | Exploration, critical decisions |

<div style="margin:16px 0; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b; color:#78350f;">
<strong>💡 Pro Tip:</strong> <br><br>
Start with Prompt Chaining for almost everything. Upgrade to Tree of Thoughts only when:<br>

• The decision has long-term impact (architecture, key libraries)<br>
• Multiple approaches seem equally valid<br>
• You need to justify your choice to stakeholders<br>
• The cost of choosing wrong is high
</div>



*Reference: [Tree of Thoughts Research](https://arxiv.org/abs/2305.10601) | [Prompting Guide - ToT](https://www.promptingguide.ai/techniques/tot)*

---

### 🎯 Try It Yourself: Tree of Thoughts

**Common Misconception:** The first solution that works is good enough.

**The Reality:** Exploring alternatives often reveals better approaches with significant advantages.

**Your Task:** Below, we compare two approaches for caching. The BAD prompt asks for a single solution. 

Your task is to:

1. First, run the cell to see the single-solution approach
2. Then, uncomment the GOOD section that uses tree-of-thoughts
3. Compare how exploring multiple paths leads to better decisions

The GOOD approach:
- **Step 1:** Generate 3 different caching strategies
- **Step 2:** Evaluate each on performance, complexity, and cost
- **Step 3:** Select the best fit with justification

See how structured comparison leads to informed decisions!

#### 💻 Run This Cell to Compare Decision Approaches

**What this cell does:**  
Compares single-path (first idea) vs. tree-of-thoughts (explore alternatives) for caching strategy.

**⏱️ Execution time:** ~10 seconds for single path, ~40 seconds for ToT (when uncommented)

**Your task:**
1. Run as-is to see single-solution approach (first idea wins)
2. Uncomment the ToT section (3-step exploration)
3. Run again to see structured comparison of 3 caching strategies
4. Notice how exploration reveals trade-offs you'd otherwise miss

**💡 Watch for:** How client-side vs CDN vs server-side caching have very different cost/latency trade-offs!

#### 💻 Run Example: Tree of Thoughts in Action

**What this cell does:**  
Demonstrates exploring 3 different algorithms for finding duplicates, then selecting the optimal one:
- **Step 1:** Generate 3 alternative approaches (nested loops, hash set, sorting)
- **Step 2:** Evaluate each on performance, memory, scalability, complexity
- **Step 3:** Select the best approach with clear justification

**⏱️ Execution time:** ~30-40 seconds (makes 3 API calls)

**What to observe:**
- How different algorithms have different trade-offs
- The structured comparison table showing scores
- Why the selected approach is optimal for the specific use case (1M records)

**💡 Learning focus:** This is how you make informed architecture decisions in real projects - explore, compare, decide!

In [None]:
# ❌ BAD: Single solution without exploring alternatives
bad_messages = [{
    "role": "user",
    "content": "Design a caching strategy for our API to reduce database load."
}]
bad_response = get_chat_completion(bad_messages)
print("=" * 70)
print("SINGLE PATH (First idea that comes to mind):")
print("=" * 70)
print(bad_response)
print("\n")

# ✅ YOUR TURN: Use tree-of-thoughts to explore alternatives
# TODO: Uncomment and complete
# print("=" * 70)
# print("TREE OF THOUGHTS (Explore → Compare → Decide):")
# print("=" * 70)
#
# # STEP 1: Generate 3 caching strategies
# step1 = get_chat_completion([{
#     "role": "user",
#     "content": """Generate 3 DIFFERENT caching strategies for an API:
# 
# 1. Client-side caching (browser cache)
# 2. CDN caching (edge network)
# 3. Server-side caching (Redis/Memcached)
# 
# For each, describe: how it works, what it caches, TTL approach"""
# }])
# print("\n🌳 STEP 1 - Alternative Approaches:")
# print(step1)
#
# # STEP 2: Evaluate on multiple criteria
# step2 = get_chat_completion([{
#     "role": "user",
#     "content": f"""Evaluate these caching strategies:
# 
# {step1}
# 
# Compare on:
# - **Latency reduction:** How much faster?
# - **Database savings:** % of requests saved
# - **Cost:** Infrastructure/bandwidth costs
# - **Complexity:** Implementation difficulty
# - **Invalidation:** How to handle stale data
# 
# Provide scores 1-10 for each criterion."""
# }])
# print("\n⚖️ STEP 2 - Trade-off Analysis:")
# print(step2)
#
# # STEP 3: Select best approach with justification
# step3 = get_chat_completion([{
#     "role": "user",
#     "content": f"""Based on this analysis:
# 
# {step2}
# 
# Recommend the BEST caching strategy (or hybrid approach) with:
# - Clear justification
# - When to use it
# - Implementation priorities
# - Potential pitfalls"""
# }])
# print("\n🎯 STEP 3 - Optimal Solution:")
# print(step3)
#
# print("\n" + "=" * 70)
# print("💡 COMPARISON:")
# print("  Single: First idea → No comparison → May not be optimal")
# print("  ToT: Multiple paths → Explicit trade-offs → Informed decision")
# print("=" * 70)

---



### ⚖️ Tactic 8: LLM-as-Judge



**Create evaluation rubrics and self-critique loops**



**Core Principle:** One of the most powerful patterns in prompt engineering is using an AI model as a judge or critic to evaluate and improve outputs. This creates a self-improvement loop where the AI reviews, critiques, and refines work—either its own outputs or those from other sources.



**Why Use LLM-as-Judge:**

- **Quality assurance:** Catch errors, inconsistencies, and areas for improvement

- **Objective evaluation:** Get unbiased assessment based on specific criteria

- **Iterative refinement:** Continuously improve outputs through multiple review cycles

- **Scalable review:** Automate code reviews, documentation checks, and quality audits



**When to Use LLM-as-Judge:**

- Code review and quality assessment

- Evaluating multiple solution approaches

- Grading or scoring responses against rubrics

- Providing constructive feedback on technical writing

- Testing and validation of AI-generated content

- Comparing different implementations



**How to Implement:**

1. **Define clear criteria:** Specify what makes a good/bad output

2. **Provide rubrics:** Give the judge specific evaluation dimensions

3. **Request structured feedback:** Ask for scores, ratings, or categorized feedback

4. **Include examples:** Show what excellent vs. poor outputs look like

5. **Iterate:** Use feedback to improve and re-evaluate



Automated code reviews, architecture decision validation, test coverage assessment, documentation quality checks, and comparing multiple implementation approaches all benefit from LLM-as-Judge.



*Reference: This technique combines elements from evaluation frameworks and self-critique patterns used in production AI systems.*


#### 📚 Preview: Three LLM-as-Judge Patterns

Before diving into examples, here's a quick overview of the three ways to use LLM-as-Judge:

| Pattern | Use Case | What You'll Learn |
|---------|----------|-------------------|
| 📋 **Pattern 1: Single** | Judge one output | "Is this code good enough?" |
| ⚖️ **Pattern 2: Compare** | Judge two options | "Which approach is better?" |
| 🔄 **Pattern 3: Self-Improve** | AI judges its own work | "Generate → Critique → Fix" |

---

#### 🎯 Pattern Implementation Guide: What, Why, and How

Understanding when and how to use each pattern is key to effective LLM-as-Judge implementation.

<div style="background: white; padding: 25px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); margin: 20px 0;">

<!-- Pattern 1: Single Evaluation -->
<div style="background: linear-gradient(135deg, #dbeafe 0%, #bfdbfe 100%); padding: 20px; border-radius: 10px; border-left: 5px solid #3b82f6; margin-bottom: 20px;">

<div style="display: flex; align-items: center; gap: 15px; margin-bottom: 15px;">
<div style="width: 50px; height: 50px; background: #3b82f6; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 24px; flex-shrink: 0;">1</div>
<div>
<div style="font-weight: 700; font-size: 18px; color: #1e40af;">📋 Single Evaluation</div>
<div style="font-size: 14px; color: #1e40af; font-style: italic;">"Is this good enough?"</div>
</div>
</div>


<div style="font-weight: 700; font-size: 16px; color: #1e40af; margin: 15px 0 10px 0;">📊 Visual Flow:</div>

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #3b82f6; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0; text-align: center;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">INPUT</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #9333ea; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[JUDGE]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #F59E0B; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">SCORE</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #10B981; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">DECISION</span>
  </div>
</div>

<div style="background: white; padding: 15px; border-radius: 8px; margin: 10px 0;">
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #1e293b;">
<strong>Input:</strong> One output to evaluate<br>
<strong>Judge asks:</strong> Does it meet our standards?<br>
<strong>Output:</strong> Score + Pass/Fail + Feedback
</div>
</div>

<table style="width: 100%; margin-top: 10px; border: none;">
<tr>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #1e40af;">
<strong>✅ Best for:</strong><br>
- PR quality gates<br>
- Documentation reviews<br>
- Production readiness checks
</td>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #1e40af;">
<strong>💡 Example:</strong><br>
Code → Judge scores security, performance, style → Approve or reject merge
</td>
</tr>
</table>

</div>

</div>

<div style="background: white; padding: 25px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); margin: 20px 0;">

<!-- Pattern 2: Comparative -->
<div style="background: linear-gradient(135deg, #fef3c7 0%, #fde68a 100%); padding: 20px; border-radius: 10px; border-left: 5px solid #f59e0b; margin-bottom: 20px;">

<div style="display: flex; align-items: center; gap: 15px; margin-bottom: 15px;">
<div style="width: 50px; height: 50px; background: #f59e0b; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 24px; flex-shrink: 0;">2</div>
<div>
<div style="font-weight: 700; font-size: 18px; color: #78350f;">⚖️ Comparative Evaluation</div>
<div style="font-size: 14px; color: #78350f; font-style: italic;">"Which option is better?"</div>
</div>
</div>


<div style="font-weight: 700; font-size: 16px; color: #78350f; margin: 15px 0 10px 0;">📊 Visual Flow:</div>

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #f59e0b; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0; text-align: center;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">INPUT_A</span>
    <span style="margin: 0 5px; color: #F59E0B; font-size: 16px; font-weight: bold;">vs</span>
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">INPUT_B</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #9333ea; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[JUDGE]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #10B981; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">WINNER</span>
  </div>
</div>

<div style="background: white; padding: 15px; border-radius: 8px; margin: 10px 0;">
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #1e293b;">
<strong>Input:</strong> Option A vs Option B<br>
<strong>Judge asks:</strong> Which wins on each criterion?<br>
<strong>Output:</strong> Scores for both + Winner + Justification
</div>
</div>

<table style="width: 100%; margin-top: 10px; border: none;">
<tr>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #78350f;">
<strong>✅ Best for:</strong><br>
- Architecture decisions<br>
- Library/framework selection<br>
- Design pattern choices
</td>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #78350f;">
<strong>💡 Example:</strong><br>
REST vs GraphQL → Judge compares performance, complexity, scalability → Recommend winner
</td>
</tr>
</table>

</div>

</div>

<div style="background: white; padding: 25px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); margin: 20px 0;">

<!-- Pattern 3: Self-Improvement -->
<div style="background: linear-gradient(135deg, #d1fae5 0%, #a7f3d0 100%); padding: 20px; border-radius: 10px; border-left: 5px solid #10b981; margin-bottom: 20px;">

<div style="display: flex; align-items: center; gap: 15px; margin-bottom: 15px;">
<div style="width: 50px; height: 50px; background: #10b981; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 24px; flex-shrink: 0;">3</div>
<div>
<div style="font-weight: 700; font-size: 18px; color: #065f46;">🔄 Self-Improvement Loop</div>
<div style="font-size: 14px; color: #065f46; font-style: italic;">"Can it improve itself automatically?"</div>
</div>
</div>


<div style="font-weight: 700; font-size: 16px; color: #065f46; margin: 15px 0 10px 0;">📊 Visual Flow:</div>

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 8px; border-left: 4px solid #10b981; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 14px;">
  <div style="margin: 15px 0; text-align: center;">
    <span style="background-color: #4A90E2; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[GENERATE]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #9333ea; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[JUDGE]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #F59E0B; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">[IMPROVE]</span>
    <span style="margin: 0 10px; color: #333; font-size: 16px;">→</span>
    <span style="background-color: #10B981; color: white; padding: 8px 12px; border-radius: 5px; font-size: 14px; font-weight: bold;">OUTPUT</span>
  </div>
</div>

<div style="background: white; padding: 15px; border-radius: 8px; margin: 10px 0;">
<div style="font-family: 'Courier New', monospace; font-size: 13px; color: #1e293b;">
<strong>Step 1:</strong> Generate initial solution<br>
<strong>Step 2:</strong> AI critiques its OWN work<br>
<strong>Step 3:</strong> AI fixes identified issues<br>
<strong>Step 4:</strong> Verify improvements (optional)
</div>
</div>

<table style="width: 100%; margin-top: 10px; border: none;">
<tr>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #065f46;">
<strong>✅ Best for:</strong><br>
- Automated code generation<br>
- Security hardening<br>
- Quality-critical workflows
</td>
<td style="width: 50%; padding: 10px; background: rgba(255,255,255,0.5); border-radius: 6px; border: none; font-size: 13px; color: #065f46;">
<strong>💡 Example:</strong><br>
Generate validator → Critique security → Fix issues → Verify → Production-ready code
</td>
</tr>
</table>

</div>

</div>

---

**Ready to try Pattern 3? Let's see self-improvement in action! 👇**
<div style="background: #f0f9ff; border-left: 4px solid #3b82f6; padding: 15px; border-radius: 8px; margin: 20px 0;">
<div style="font-weight: 700; color: #1e40af; margin-bottom: 8px;">📖 What We'll Cover</div>
<div style="font-size: 14px; color: #1e40af; line-height: 1.8;">
<strong>Example 1:</strong> Basic self-improvement loop (simple critique)<br>
<strong>Example 2:</strong> Advanced security-focused critique (weighted scoring)<br><br>
<em>Focus: Pattern 3 (Self-Improvement) - the most powerful for automated workflows</em>
</div>
</div>

<div style="background: #fef3c7; border-left: 4px solid #f59e0b; padding: 12px; border-radius: 8px; margin: 20px 0;">
<div style="font-size: 13px; color: #78350f;">
<strong>💡 Learning Strategy:</strong> Master Pattern 3 first through hands-on examples. Once you understand self-improvement loops, Patterns 1 & 2 become intuitive variations. Full reference guide at the end!
</div>
</div>

---

#### Example 1: Basic Self-Improvement Loop



This example shows the fundamental pattern: **Generate → Judge → Improve**



<div style="background: #f8fafc; padding: 25px; border-radius: 10px; box-shadow: 0 2px 8px rgba(0,0,0,0.1); margin: 20px 0;">



<div style="text-align: center; margin-bottom: 20px;">

<div style="display: inline-block; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 24px; border-radius: 8px; font-weight: bold; font-size: 16px;">

📝 Task: Create Email Validator

</div>

</div>



<!-- STEP 1 -->

<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #3b82f6; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">

<table style="width: 100%; border: none;">

<tr>

<td style="width: 70px; vertical-align: top; text-align: center; border: none;">

<div style="width: 55px; height: 55px; background: #3b82f6; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(59, 130, 246, 0.3);">1</div>

</td>

<td style="padding-left: 15px; border: none;">

<div style="font-weight: 700; font-size: 17px; color: #1e40af; margin-bottom: 6px;">🤖 AI: Code Creator</div>

<div style="font-size: 15px; color: #475569; line-height: 1.6;">

<strong>What it does:</strong> Generates working email validator<br>

<strong>Result:</strong> Functional code (but may have issues)

</div>

</td>

</tr>

</table>

</div>



<!-- Arrow -->

<div style="text-align: center; margin: 15px 0;">

<div style="font-size: 28px; color: #94a3b8; font-weight: bold;">↓</div>

<div style="font-size: 13px; color: #64748b; font-weight: 600;">Passes code to Step 2</div>

</div>



<!-- STEP 2 -->

<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #9333ea; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">

<table style="width: 100%; border: none;">

<tr>

<td style="width: 70px; vertical-align: top; text-align: center; border: none;">

<div style="width: 55px; height: 55px; background: #9333ea; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(147, 51, 234, 0.3);">2</div>

</td>

<td style="padding-left: 15px; border: none;">

<div style="font-weight: 700; font-size: 17px; color: #7c3aed; margin-bottom: 6px;">👨‍⚖️ AI: Critical Judge (LLM-as-Judge!)</div>

<div style="font-size: 15px; color: #475569; line-height: 1.6;">

<strong>What it does:</strong> Reviews the code IT JUST WROTE<br>

<strong>Looks for:</strong> Security flaws, edge cases, quality issues<br>

<strong>Result:</strong> List of problems found

</div>

</td>

</tr>

</table>

</div>



<!-- Arrow -->

<div style="text-align: center; margin: 15px 0;">

<div style="font-size: 28px; color: #94a3b8; font-weight: bold;">↓</div>

<div style="font-size: 13px; color: #64748b; font-weight: 600;">Feedback loop: Problems → Step 3</div>

</div>



<!-- STEP 3 -->

<div style="background: white; padding: 20px; border-radius: 8px; margin-bottom: 15px; border-left: 5px solid #10b981; box-shadow: 0 1px 3px rgba(0,0,0,0.1);">

<table style="width: 100%; border: none;">

<tr>

<td style="width: 70px; vertical-align: top; text-align: center; border: none;">

<div style="width: 55px; height: 55px; background: #10b981; color: white; border-radius: 50%; display: inline-flex; align-items: center; justify-content: center; font-weight: bold; font-size: 22px; box-shadow: 0 2px 4px rgba(16, 185, 129, 0.3);">3</div>

</td>

<td style="padding-left: 15px; border: none;">

<div style="font-weight: 700; font-size: 17px; color: #047857; margin-bottom: 6px;">🔧 AI: Code Improver</div>

<div style="font-size: 15px; color: #475569; line-height: 1.6;">

<strong>What it does:</strong> Fixes all identified problems<br>

<strong>Input:</strong> Original code + critique feedback<br>

<strong>Result:</strong> Improved, robust validator

</div>

</td>

</tr>

</table>

</div>



<!-- Arrow to Result -->

<div style="text-align: center; margin: 15px 0;">

<div style="font-size: 28px; color: #10b981; font-weight: bold;">↓</div>

</div>



<!-- FINAL RESULT -->

<div style="text-align: center;">

<div style="display: inline-block; background: linear-gradient(135deg, #fbbf24 0%, #f59e0b 100%); color: #78350f; padding: 15px 30px; border-radius: 10px; font-weight: bold; font-size: 17px; box-shadow: 0 4px 6px rgba(251, 191, 36, 0.3);">

✨ High-Quality Code (Self-Corrected!)

</div>

</div>



</div>



<!-- Key Insight Box -->

<div style="background: linear-gradient(135deg, #e0f2fe 0%, #bae6fd 100%); border-left: 4px solid #0284c7; padding: 18px; border-radius: 8px; margin: 20px 0;">

<div style="font-weight: 700; color: #0c4a6e; margin-bottom: 10px; font-size: 16px;">💡 Why This Works</div>

<table style="width: 100%; border: none; font-size: 14px; color: #0c4a6e;">

<tr>

<td style="width: 50%; padding-right: 15px; border: none; vertical-align: top;">

<strong>❌ Single Prompt:</strong><br>

"Create perfect code"<br><br>

→ AI tries to do everything<br>

→ May miss issues<br>

→ No quality check

</td>

<td style="width: 50%; padding-left: 15px; border: none; vertical-align: top; border-left: 2px solid #38bdf8;">

<strong>✅ Self-Improvement Loop:</strong><br>

3 separate, focused tasks<br><br>

→ AI creates (Step 1)<br>

→ AI critiques (Step 2)<br>

→ AI improves (Step 3)<br>

→ Quality guaranteed!

</td>

</tr>

</table>

</div>



<!-- Watch For Box -->

<div style="background: #fef3c7; border-left: 4px solid #f59e0b; padding: 15px; border-radius: 8px; margin: 20px 0;">

<div style="font-size: 14px; color: #78350f; line-height: 1.7;">

<strong>🔍 Watch for this:</strong> When you run the code below, notice how Step 2 finds issues that weren't obvious in Step 1. This is the power of separating creation from critique!

</div>

</div>



---



**Now let's see it in action! 👇 Run the code below:**

#### 💻 Run Example 1: Basic Self-Improvement Loop

**What this cell does:**  
Demonstrates the fundamental 3-step pattern with a simple email validator:
- **Step 1:** AI generates initial email validation code
- **Step 2:** AI reviews its own code for issues
- **Step 3:** AI improves based on self-critique

**⏱️ Execution time:** ~20-30 seconds (makes 3 API calls)

**What to observe:**
- How the AI finds issues in its own initial code
- The types of problems identified (edge cases, validation gaps)
- How the improved version addresses all concerns

**💡 Learning focus:** Notice the separation of creation vs. critique - this is the key to quality!

In [None]:
# Example 1: Self-correction with LLM-as-Judge

# This example demonstrates how AI can review and improve its own code through a 3-step process:
# Step 1: Generate initial solution
# Step 2: Self-review (AI acts as judge of its own work)
# Step 3: Self-improve based on identified issues

requirement = "Function to validate email addresses"

# STEP 1: Generate initial solution
# Why: Get a first draft that AI can later critique
# What to expect: Working code, but possibly with edge cases or quality issues
print("="*70)
print("STEP 1: Generate Initial Solution")
print("="*70)
print("Asking AI to create an email validator...")
print()

initial_messages = [{
    "role": "user",
    "content": f"{requirement}\n\nProvide code in <code> tags."
}]

initial = get_chat_completion(initial_messages)
print(initial)
print()

# STEP 2: Self-review (AI critiques its OWN work)
# Why: AI can spot issues in code better when focused solely on critique
# What to look for: Security flaws, missing edge cases, validation gaps
print("="*70)
print("STEP 2: AI Reviews Its Own Work (Acts as Judge)")
print("="*70)
print("Now AI will critique the code IT JUST WROTE...")
print("Looking for: Security issues, edge cases, validation problems")
print()

critique_messages = [{
    "role": "user",
    "content": f"""Review YOUR code for issues:

{initial}

Check for:
- Security vulnerabilities (e.g., regex denial of service)
- Edge cases not handled (e.g., empty strings, None, international emails)
- Missing validation (e.g., length limits, special characters)
- Code quality issues (e.g., missing error messages, unclear logic)

Identify problems in <issues> tags. If none found, say "No issues found"."""
}]

critique = get_chat_completion(critique_messages)
print(critique)
print()

# STEP 3: Self-improve based on own critique
# Why: AI now knows exactly what to fix from its own analysis
# What to expect: More robust code addressing all identified issues
print("="*70)
print("STEP 3: AI Improves Based on Self-Review")
print("="*70)
print("AI will now fix the issues it identified...")
print()

improve_messages = [{
    "role": "user",
    "content": f"""Your original code:
{initial}

Your self-review identified:
{critique}

If issues were found, provide improved code in <improved_code> tags.
If no issues, return the original code.
Only change what's necessary to address the identified problems."""
}]

improved = get_chat_completion(improve_messages)
print(improved)

# Summary explanation
print("\n" + "="*70)
print("💡 KEY INSIGHT: Self-Correction with LLM-as-Judge")
print("="*70)
print("""
This 3-step pattern creates automated quality improvement:

✓ Step 1 (Generate): AI creates initial solution quickly
✓ Step 2 (Judge): AI switches to critic mode, finds all issues  
✓ Step 3 (Improve): AI fixes identified problems systematically

Benefits:
• Catches mistakes automatically without human review
• Each step has single focus (create vs. critique vs. fix)
• Quality improves through self-reflection
• Perfect for automated workflows and CI/CD pipelines

Use cases:
• Automated code generation in IDEs
• PR quality checks before human review
• Documentation generation with quality gates
• Test generation with self-validation
""")

#### Example 2: Security-Focused Critique with Weighted Scoring



Now let's level up! This example adds **weighted evaluation criteria** for security-critical code.



<div style="background: linear-gradient(135deg, #fee2e2 0%, #fecaca 100%); padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 4px solid #dc2626;">

<div style="text-align: center; font-weight: 700; font-size: 17px; color: #7f1d1d; margin-bottom: 15px;">

🔒 What's Different in Example 2?

</div>



<table style="width: 100%; border: none; font-size: 14px;">

<tr style="border-bottom: 2px solid #dc2626;">

<th style="padding: 12px; text-align: left; color: #7f1d1d; border: none; font-weight: 700;">Aspect</th>

<th style="padding: 12px; text-align: center; color: #7f1d1d; border: none; font-weight: 700;">Example 1: Basic</th>

<th style="padding: 12px; text-align: center; color: #7f1d1d; border: none; font-weight: 700;">Example 2: Advanced</th>

</tr>

<tr style="background: white;">

<td style="padding: 12px; color: #991b1b; border: none; font-weight: 600;">Critique Style</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">Simple review<br>"Are there issues?"</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">Weighted scoring<br>"Rate each criterion"</td>

</tr>

<tr style="background: #fef2f2;">

<td style="padding: 12px; color: #991b1b; border: none; font-weight: 600;">Evaluation Criteria</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">General checks</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;"><strong>Security 60%</strong><br>Best Practices 25%<br>Error Handling 15%</td>

</tr>

<tr style="background: white;">

<td style="padding: 12px; color: #991b1b; border: none; font-weight: 600;">Severity Ratings</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">❌ No</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">✅ CRITICAL/HIGH/MEDIUM/LOW</td>

</tr>

<tr style="background: #fef2f2;">

<td style="padding: 12px; color: #991b1b; border: none; font-weight: 600;">Best For</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">Learning the pattern<br>General code</td>

<td style="padding: 12px; text-align: center; color: #44403c; border: none;">Production systems<br>Security-critical code</td>

</tr>

</table>

</div>



<div style="background: #fffbeb; border-left: 4px solid #f59e0b; padding: 16px; border-radius: 8px; margin: 20px 0;">

<div style="font-weight: 700; color: #78350f; margin-bottom: 8px; font-size: 15px;">💡 Think of it like a health checkup:</div>

<table style="width: 100%; border: none; font-size: 14px; color: #78350f;">

<tr>

<td style="width: 50%; padding-right: 15px; border: none; vertical-align: top;">

<strong>Example 1 = Quick Visit</strong><br>

"Doc, do I have any health problems?"<br>

→ "Yes, you need to exercise more"

</td>

<td style="width: 50%; padding-left: 15px; border: none; vertical-align: top; border-left: 2px solid #fbbf24;">

<strong>Example 2 = Full Physical</strong><br>

"Check everything with scores!"<br>

→ Heart: 6/10 (CRITICAL)<br>

→ Lungs: 9/10 (OK)<br>

→ Vision: 8/10 (OK)

</td>

</tr>

</table>

</div>



<div style="background: linear-gradient(135deg, #dbeafe 0%, #bfdbfe 100%); border-left: 4px solid #2563eb; padding: 16px; border-radius: 8px; margin: 20px 0;">

<div style="font-weight: 700; color: #1e3a8a; margin-bottom: 10px; font-size: 15px;">🎯 Why Weighted Criteria Matter</div>

<div style="font-size: 14px; color: #1e3a8a; line-height: 1.7;">

For a <strong>banking app</strong>:<br>

✓ Security gets 60% weight → SQL injection = CRITICAL priority<br>

✓ Best practices get 25% → Use parameterized queries<br>

✓ Error handling gets 15% → Don't leak database structure<br><br>

<strong>Result:</strong> Security issues are fixed FIRST, not treated equally with minor style issues.

</div>

</div>



---



**Let's see weighted security critique in action! 👇**

#### 💻 Run Example 2: Security-Focused Critique

**What this cell does:**  
Advanced version with **weighted security criteria** (60% security, 25% best practices, 15% error handling):
- **Step 1:** AI generates SQL input validator (likely vulnerable)
- **Step 2:** Security expert mode with severity ratings (CRITICAL/HIGH/MEDIUM/LOW)
- **Step 3:** Hardened code addressing all security flaws

**⏱️ Execution time:** ~30-40 seconds (makes 3 API calls)

**What to observe:**
- Initial code typically has SQL injection vulnerabilities
- Critique provides severity ratings and prioritizes fixes
- Improved code uses parameterized queries and proper validation

**💡 Learning focus:** Weighted criteria ensure critical security issues get fixed FIRST!

In [None]:
# Example 2: Security-Focused Self-Critique Loop

# This example shows a more sophisticated critique loop with weighted evaluation criteria.
# Perfect for security-critical code where detailed analysis is essential.

requirement = "Create a function that validates and sanitizes user input for a SQL query"

# STEP 1: Generate initial solution
# Goal: Get initial implementation without security guidance
# Expected: Basic solution that may have SQL injection vulnerabilities
print("=" * 70)
print("STEP 1: Generate Initial Solution")
print("=" * 70)
print("Requesting initial implementation...")
print("(No security guidance given yet - let's see what AI produces)")
print()

generate_messages = [
    {
        "role": "system",
        "content": "You are a Python developer. Generate code solutions."
    },
    {
        "role": "user",
        "content": f"""{requirement}

Provide your implementation in <code> tags."""
    }
]

initial_code = get_chat_completion(generate_messages)
print(initial_code)
print("\n")

# STEP 2: Detailed critique with weighted criteria
# Goal: Get comprehensive security-focused review
# Why separate role: Security reviewers think differently than code generators
# What to expect: Identification of SQL injection risks, input validation gaps
print("=" * 70)
print("STEP 2: Security-Focused Critique (AI as Security Expert)")
print("=" * 70)
print("Switching AI to 'security reviewer' mode...")
print("Will evaluate with weighted criteria:")
print("  • Security (Critical) - SQL injection, input validation")
print("  • Best practices - Proper escaping, parameterization")
print("  • Error handling - Graceful failures, no info leakage")
print()

critique_messages = [
    {
        "role": "system",
        "content": """You are a security-focused code reviewer. 

Evaluate code for:
- Security vulnerabilities (HIGHEST PRIORITY)
- Best practices
- Error handling
- Edge cases
- Code quality

Provide brutally honest feedback with specific issues and severity levels."""
    },
    {
        "role": "user",
        "content": f"""Requirement: {requirement}

Initial implementation:
{initial_code}

Critique this implementation using these weighted criteria:

**Security (60% weight) - CRITICAL:**
- SQL injection vulnerabilities
- Input validation gaps
- Sanitization effectiveness

**Best Practices (25% weight):**
- Use of parameterized queries
- Proper escaping methods
- Following secure coding standards

**Error Handling (15% weight):**
- Graceful failure on invalid input
- No information leakage in errors
- Clear error messages

For each criterion:
1. Identify specific issues
2. Rate severity: CRITICAL / HIGH / MEDIUM / LOW
3. Explain the risk
4. Suggest fix

Structure your response:
<critique>Your detailed security analysis</critique>
<issues>List of issues with severity ratings</issues>
<suggestions>Prioritized fixes (most critical first)</suggestions>"""
    }
]

critique = get_chat_completion(critique_messages)
print(critique)
print("\n")

# STEP 3: Implement fixes addressing critique
# Goal: Produce production-ready secure code
# Why this works: AI has full context of requirements + all identified issues
# What to expect: Parameterized queries, input validation, proper error handling
print("=" * 70)
print("STEP 3: Implement Secure Solution")
print("=" * 70)
print("AI will now create hardened code addressing all security issues...")
print()

improve_messages = [
    {
        "role": "system",
        "content": "You are a senior Python developer who learns from feedback and writes secure code."
    },
    {
        "role": "user",
        "content": f"""Requirement: {requirement}

Original implementation:
{initial_code}

Security critique received:
{critique}

Create an improved implementation that addresses ALL the issues raised in the critique.

Requirements for improved version:
• Fix ALL critical security vulnerabilities  
• Implement suggested best practices
• Add robust error handling
• Include inline comments explaining security measures

Provide the improved code in <improved_code> tags.
Explain key security changes in <changes> tags."""
    }
]

improved_code = get_chat_completion(improve_messages)
print(improved_code)

# Summary with actionable insights
print("\n" + "=" * 70)
print("💡 KEY INSIGHTS: Weighted Security Critique")
print("=" * 70)
print("""
This pattern demonstrates security-first development:

Step 1 - Generate: Initial code without security constraints
         → Reveals natural vulnerabilities in approach

Step 2 - Judge: Security expert mode with weighted criteria
         → Prioritizes issues (Security 60%, Best practices 25%, Errors 15%)
         → Provides severity ratings for triage
         
Step 3 - Harden: Fixes issues in priority order
         → Addresses critical security flaws first
         → Implements defense-in-depth measures

Why weighted criteria matter:
• Security gets 60% weight → SQL injection ranked CRITICAL
• Best practices get 25% → Parameterized queries recommended  
• Error handling gets 15% → Info leakage prevented

Real-world application:
• Automated security reviews in CI/CD
• Pre-production code hardening
• Security training (show before/after)
• Compliance checking (e.g., OWASP Top 10)

Pro tip: Run this pattern before code review to catch
obvious security issues, letting humans focus on logic
and architecture decisions.
""")

#### Key Takeaways



**Pattern Selection Guide (Choose Your Approach)**



| Your Need | Pattern | How It Works | Use Case |

|-----------|---------|--------------|----------|

| **Validate quality** | 📋 Single Evaluation | Judge ONE output against criteria | PR quality gates, doc reviews |

| **Choose best option** | ⚖️ Comparative | Judge TWO options side-by-side | Architecture decisions, A/B testing |

| **Auto-improve** | 🔄 Self-Improvement | Generate → Critique → Fix loop | Code generation, security audits |



**The 4 Rules for Effective Judging**



| Rule | ❌ Don't | ✅ Do |

|------|----------|--------|

| **1. Be Specific** | "Evaluate code quality" | "Score: Security 40%, Performance 30%, Readability 30%" |

| **2. Weight Priorities** | Treat all criteria equally | **Critical aspects get 50-60% weight** |

| **3. Request Severity** | Get yes/no answers | Ask for: `CRITICAL / HIGH / MEDIUM / LOW` |

| **4. Chain Reviews** | One big review | Chain 1: Security → Chain 2: Performance → Chain 3: Synthesize |



**Copy-Paste Code Template**



<div style="background: #1e293b; color: white; padding: 20px; border-radius: 10px; margin: 20px 0; font-family: 'Courier New', monospace; font-size: 13px;">



**Pattern 3: Self-Improvement Loop (Most Common)**



```python

# Step 1: Generate initial solution

draft = ai("Create [SOLUTION] for [PROBLEM]")



# Step 2: AI judges its own work

critique = ai(f"""

Review YOUR code: {draft}



Evaluate (weighted):

- Security (40%): [specific criteria]

- Performance (30%): [specific criteria]  

- Readability (30%): [specific criteria]



For each: Score 0-10, list issues, rate severity (CRITICAL/HIGH/MEDIUM/LOW)

""")



# Step 3: AI improves based on critique

final = ai(f"""

Original code: {draft}

Issues found: {critique}



Fix all identified problems. Provide improved code.

""")

```



**When to use:** Production code, automated QA, no human reviewer  

**Cost:** 3 API calls (but 3× better quality!)



</div>





<div style="background: linear-gradient(135deg, #ecfdf5 0%, #d1fae5 100%); padding: 20px; border-radius: 10px; margin: 25px 0; border-left: 5px solid #10b981;">

<div style="font-size: 18px; font-weight: 700; color: #065f46; margin-bottom: 10px;">

🎯 Core Principle (Remember This!)

</div>

<div style="font-size: 16px; color: #047857; line-height: 1.7; margin-bottom: 10px;">

<strong>Separate creation from critique.</strong><br>

AI finds more issues when it's not simultaneously trying to create.

</div>

<div style="font-size: 14px; color: #065f46; font-style: italic;">

Why it works: Just like humans write better after editing their own drafts, AI produces higher quality when it critiques first, then improves.

</div>

</div>





**Real-world wins:** 

- Auto-review PRs before human review

- Compare microservices vs monolith

- Security audits with weighted scoring

- Test generation with self-validation



*Reference: [OpenAI Evals](https://github.com/openai/evals) | [Anthropic Constitutional AI](https://www.anthropic.com/index/constitutional-ai-harmlessness-from-ai-feedback)*

---



### 🎯 Try It Yourself: LLM-as-Judge



**Common Misconception:** AI comparisons are just subjective opinions without clear reasoning.



**The Reality:** Weighted evaluation rubrics produce objective, actionable assessments.



**Your Task:** You have two implementations below. The bad prompt just asks "which is better?" Fix it by:

1. Creating specific evaluation criteria

2. Adding weights to each criterion (e.g., Security 40%, Performance 30%, Readability 30%)

3. Requesting scores and structured comparison



See how rubrics transform vague opinions into actionable insights!

#### 💻 Run This Cell to Test Evaluation Rubrics

**What this cell does:**  
Compares two password hashing functions (MD5 vs bcrypt) with and without evaluation rubrics.

**⏱️ Execution time:** ~10 seconds for BAD, ~15 seconds for GOOD (when uncommented)

**Your task:**
1. Run as-is to see vague "which is better?" comparison
2. Uncomment the GOOD section with weighted rubrics
3. Run again to see structured evaluation with scores
4. Notice how rubrics produce actionable insights!

In [None]:
# Two implementations to compare
impl_a = "def hash_pwd(p): return hashlib.md5(p.encode()).hexdigest()"
impl_b = "def hash_pwd(p): return bcrypt.hashpw(p.encode(), bcrypt.gensalt())"

# ❌ BAD: Vague comparison request
bad_messages = [{"role": "user", "content": f"Which is better?\nA: {impl_a}\nB: {impl_b}"}]
bad_response = get_chat_completion(bad_messages)
print("=" * 70)
print("WITHOUT RUBRIC (Vague opinion):")
print("=" * 70)
print(bad_response)
print("\n")

# ✅ YOUR TURN: Create evaluation rubric
# TODO: Uncomment and complete
# good_messages = [{
#     "role": "system",
#     "content": """You are a code quality judge. Evaluate based on:
# - Security (40%): Resistance to attacks, proper crypto
# - Performance (30%): Speed, resource usage
# - Readability (30%): Clear, maintainable
# 
# Provide scores 0-10 for each, calculate weighted total, recommend best option."""
#     },
#     {
#         "role": "user",
#         "content": f"""Compare these password hashing implementations:
# 
# <implementation_a>
# {impl_a}
# </implementation_a>
# 
# <implementation_b>
# {impl_b}
# </implementation_b>
# 
# Provide:
# - Scores for each criterion
# - Weighted total scores
# - Recommendation with justification"""
#     }
# ]
# good_response = get_chat_completion(good_messages)
# print("=" * 70)
# print("WITH RUBRIC (Objective assessment):")
# print("=" * 70)
# print(good_response)
# print("\n💡 Rubric provides clear, actionable comparison with reasoning!")

---

<div style="margin:20px 0; padding:16px 24px; background:linear-gradient(135deg, #ffecd2 0%, #fcb69f 100%); border-radius:10px; color:#8b4513; text-align:center; box-shadow:0 4px 15px rgba(252,182,159,0.3);">
  <strong style="font-size:1.05em;">🎉 All 8 tactics learned! Practice makes perfect.</strong><br>
  <span style="font-size:0.92em; opacity:0.95; margin-top:4px; display:block;">You've absorbed a lot—take a moment before diving into hands-on activities.</span>
</div>

---

<div style="margin:24px 0; padding:20px 24px; background:linear-gradient(135deg, #f8fafc 0%, #e2e8f0 100%); border-radius:12px; border-left:5px solid #10b981; box-shadow:0 2px 8px rgba(0,0,0,0.1);">
  <div style="color:#1e293b; font-size:0.85em; font-weight:600; text-transform:uppercase; letter-spacing:1px; margin-bottom:8px;">⏭️ Next Section</div>
  <div style="color:#0f172a; font-size:1.15em; font-weight:700; margin-bottom:6px;">Section 2.5: Hands-On Practice</div>
  <div style="color:#475569; font-size:0.95em; line-height:1.5; margin-bottom:12px;">Apply all 8 tactics independently in unguided practice activities with automated evaluation.</div>
  <a href="./2.5-hands-on-practice.ipynb" style="display:inline-block; padding:8px 16px; background:#10b981; color:#fff; text-decoration:none; border-radius:6px; font-weight:600; font-size:0.9em; transition:all 0.2s;">Continue to Section 2.5 →</a>
</div>