# Module 3 - Apply Advanced Prompting Engineering Tactics to SDLC

| **Aspect** | **Details** |
|-------------|-------------|
| **Goal** | Blend previously mastered strategies—task decomposition, role prompting, chain-of-thought reasoning, LLM-as-Judge critique, and structured formatting—to design reliable prompts for code review and software development lifecycle (SDLC) activities |
| **Time** | ~120-150 minutes (2-2.5 hours) |
| **Prerequisites** | Module 2 completion, Python 3.8+, IDE with notebook support, API access (GitHub Copilot, CircuIT, or OpenAI) |
| **Setup Required** | Clone the repository and follow [Quick Setup](../../README.md#-quick-setup) before running this notebook |

---

## 🚀 Ready to Start?

<div style="margin-top:16px; color:#991b1b; padding:12px; background:#fee2e2; border-radius:6px; border-left:4px solid #ef4444;">
<strong>⚠️ Important:</strong> <br><br>
This module builds directly on Module 2 techniques. Make sure you've completed Module 2 before starting.<br>
</div>


## 🔧 Setup: Environment Configuration

### Step 1: Install Required Dependencies

Let's start by installing the packages we need for this tutorial.

Run the cell below. You should see a success message when installation completes:


In [None]:
# Install required packages for Module 3
import subprocess
import sys

def install_requirements():
    try:
        # Install from requirements.txt
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "-r", "requirements.txt"])
        print("✅ SUCCESS! Module 3 dependencies installed successfully.")
        print("📦 Ready for: openai, anthropic, python-dotenv, requests")
    except subprocess.CalledProcessError as e:
        print(f"❌ Installation failed: {e}")
        print("💡 Try running: pip install openai anthropic python-dotenv requests")

install_requirements()


### Step 2: Connect to AI Model

<div style="margin-top:16px; color:#78350f; padding:12px; background:#fef3c7; border-radius:6px; border-left:4px solid #f59e0b;">
<strong>💡 Note:</strong> <br><br>
The code below runs on your local machine and connects to AI services over the internet.
</div>

Choose your preferred option:

- **Option A: GitHub Copilot API (local proxy)** ⭐ **Recommended**: 
  - Supports both **Claude** and **OpenAI** models
  - No API keys needed - uses your GitHub Copilot subscription
  - Follow [GitHub-Copilot-2-API/README.md](../../GitHub-Copilot-2-API/README.md) to authenticate and start the local server
  - Run the setup cell below and **edit your preferred provider** (`"openai"` or `"claude"`) by setting the `PROVIDER` variable
  - Available models:
    - **OpenAI**: gpt-4o, gpt-4, gpt-3.5-turbo, o3-mini, o4-mini
    - **Claude**: claude-3.5-sonnet, claude-3.7-sonnet, claude-sonnet-4

- **Option B: OpenAI API**: If you have OpenAI API access, uncomment and run the **Option B** cell below.

- **Option C: CircuIT APIs (Azure OpenAI)**: If you have CircuIT API access, uncomment and run the **Option C** cell below.


In [None]:
# Option A: GitHub Copilot API setup (Recommended)
import openai
import anthropic
import os

# ============================================
# 🎯 CHOOSE YOUR AI MODEL PROVIDER
# ============================================
# Set your preference: "openai" or "claude"
PROVIDER = "claude"  # Change to "claude" to use Claude models

# ============================================
# 📋 Available Models by Provider
# ============================================
# OpenAI Models (via GitHub Copilot):
#   - gpt-4o (recommended, supports vision)
#   - gpt-4
#   - gpt-3.5-turbo
#   - o3-mini, o4-mini
#
# Claude Models (via GitHub Copilot):
#   - claude-3.5-sonnet (recommended, supports vision)
#   - claude-3.7-sonnet (supports vision)
#   - claude-sonnet-4 (supports vision)
# ============================================

# Configure clients for both providers
openai_client = openai.OpenAI(
    base_url="http://localhost:7711/v1",
    api_key="dummy-key"
)

claude_client = anthropic.Anthropic(
    api_key="dummy-key",
    base_url="http://localhost:7711"
)

# Set default models for each provider
OPENAI_DEFAULT_MODEL = "gpt-5"
CLAUDE_DEFAULT_MODEL = "claude-sonnet-4"


def _extract_text_from_blocks(blocks):
    """Extract text content from response blocks returned by the API."""
    parts = []
    for block in blocks:
        text_val = getattr(block, "text", None)
        if isinstance(text_val, str):
            parts.append(text_val)
        elif isinstance(block, dict):
            t = block.get("text")
            if isinstance(t, str):
                parts.append(t)
    return "\n".join(parts)


def get_openai_completion(messages, model=None, temperature=0.0):
    """Get completion from OpenAI models via GitHub Copilot."""
    if model is None:
        model = OPENAI_DEFAULT_MODEL
    try:
        response = openai_client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"❌ Error: {e}\n💡 Make sure GitHub Copilot proxy is running on port 7711"


def get_claude_completion(messages, model=None, temperature=0.0):
    """Get completion from Claude models via GitHub Copilot."""
    if model is None:
        model = CLAUDE_DEFAULT_MODEL
    try:
        response = claude_client.messages.create(
            model=model,
            max_tokens=8192,
            messages=messages,
            temperature=temperature
        )
        return _extract_text_from_blocks(getattr(response, "content", []))
    except Exception as e:
        return f"❌ Error: {e}\n💡 Make sure GitHub Copilot proxy is running on port 7711"


def get_chat_completion(messages, model=None, temperature=0.0):
    """
    Generic function to get chat completion from any provider.
    Routes to the appropriate provider-specific function based on PROVIDER setting.
    """
    if PROVIDER.lower() == "claude":
        return get_claude_completion(messages, model, temperature)
    else:  # Default to OpenAI
        return get_openai_completion(messages, model, temperature)


def get_default_model():
    """Get the default model for the current provider."""
    if PROVIDER.lower() == "claude":
        return CLAUDE_DEFAULT_MODEL
    else:
        return OPENAI_DEFAULT_MODEL


# ============================================
# 🧪 TEST CONNECTION
# ============================================
print("🔄 Testing connection to GitHub Copilot proxy...")
test_result = get_chat_completion([
    {"role": "user", "content": "Say 'Connection successful!' if you can read this."}
])

if test_result and ("successful" in test_result.lower() or "success" in test_result.lower()):
    print(f"✅ Connection successful! Using {PROVIDER.upper()} provider with model: {get_default_model()}")
    print(f"📝 Response: {test_result}")
else:
    print("⚠️ Connection test completed but response unexpected:")
    print(f"📝 Response: {test_result}")


## 🎯 Applying Prompt Engineering to SDLC Tasks

---

### Introduction: From Tactics to Real-World Applications

#### 🚀 Ready to Transform Your Development Workflow?

You've successfully mastered the core tactics in Module 2. Now comes the exciting part - **applying these techniques to real-world software engineering challenges** that you face every day.

Think of what you've accomplished so far as **learning individual martial arts moves**. Now we're going to **choreograph them into powerful combinations** that solve actual development problems.


#### 👨‍💻 What You're About to Master

In the next sections, you'll discover **how to combine tactics strategically** to build production-ready prompts for critical SDLC tasks:

<div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 16px; margin: 20px 0;">

<div style="background: #f8fafc; border: 2px solid #e2e8f0; border-radius: 8px; padding: 16px; text-align: center; color: #000000;">
<strong>🔍 Code Review Automation</strong><br>
<em>Comprehensive review prompts with structured feedback</em>
</div>

<div style="background: #f8fafc; border: 2px solid #e2e8f0; border-radius: 8px; padding: 16px; text-align: center; color: #000000;">
<strong>🧪 Test Generation & QA</strong><br>
<em>Smart test plans with coverage gap analysis</em>
</div>

<div style="background: #f8fafc; border: 2px solid #e2e8f0; border-radius: 8px; padding: 16px; text-align: center; color: #000000;">
<strong>⚖️ Quality Validation</strong><br>
<em>LLM-as-Judge rubrics for output verification</em>
</div>

<div style="background: #f8fafc; border: 2px solid #e2e8f0; border-radius: 8px; padding: 16px; text-align: center; color: #000000;">
<strong>📋 Reusable Templates</strong><br>
<em>Parameterized prompts for CI/CD integration</em>
</div>

</div>

<div style="margin-top:16px; color:#15803d; padding:12px; background:#dcfce7; border-radius:6px; border-left:4px solid #22c55e;">
<strong>💡 Pro Tip:</strong> <br><br>
This module covers practical applications over 120-150 minutes. <strong>Take short breaks</strong> between sections to reflect on how each template applies to your projects. <strong>Make notes</strong> as you progress—jot down specific use cases from your codebase. The key skill is learning <strong>which tactic combinations solve which problems</strong>!
</div>

---


### 📍 How to Use Break Points

<div style="background:#f0f9ff; border-left:4px solid #3b82f6; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#1e40af;">💡 Taking Breaks? We've Got You Covered!</strong><br><br>

This module is designed for 120-150 minutes of focused learning. To help you manage your time effectively, we've added **4 strategic break points** throughout:

<table style="width:100%; margin:10px 0; border-collapse: collapse;">
  <tr style="background:#dbeafe;">
    <th style="padding:8px; text-align:left; border:1px solid #93c5fd;">Break Point</th>
    <th style="padding:8px; text-align:left; border:1px solid #93c5fd;">Location</th>
    <th style="padding:8px; text-align:left; border:1px solid #93c5fd;">Time Elapsed</th>
    <th style="padding:8px; text-align:left; border:1px solid #93c5fd;">Bookmark Text</th>
  </tr>
  <tr>
    <td style="padding:8px; border:1px solid #93c5fd;">☕ Break #1</td>
    <td style="padding:8px; border:1px solid #93c5fd;">After Section 1</td>
    <td style="padding:8px; border:1px solid #93c5fd;">~40 min</td>
    <td style="padding:8px; border:1px solid #93c5fd;">"Section 2: Test Case Generation Template"</td>
  </tr>
  <tr style="background:#eff6ff;">
    <td style="padding:8px; border:1px solid #93c5fd;">🍵 Break #2</td>
    <td style="padding:8px; border:1px solid #93c5fd;">After Section 2</td>
    <td style="padding:8px; border:1px solid #93c5fd;">~75 min</td>
    <td style="padding:8px; border:1px solid #93c5fd;">"Section 3: LLM-as-Judge Evaluation Rubric"</td>
  </tr>
  <tr>
    <td style="padding:8px; border:1px solid #93c5fd;">🧃 Break #3</td>
    <td style="padding:8px; border:1px solid #93c5fd;">After Section 3</td>
    <td style="padding:8px; border:1px solid #93c5fd;">~105 min</td>
    <td style="padding:8px; border:1px solid #93c5fd;">"Hands-On Practice Activities"</td>
  </tr>
  <tr style="background:#eff6ff;">
    <td style="padding:8px; border:1px solid #93c5fd;">🎯 Break #4</td>
    <td style="padding:8px; border:1px solid #93c5fd;">After Practice Activities</td>
    <td style="padding:8px; border:1px solid #93c5fd;">~145 min</td>
    <td style="padding:8px; border:1px solid #93c5fd;">"Section 4: Template Best Practices"</td>
  </tr>
</table>

**How to Resume Your Session:**
1. Scroll down to find the colorful break point card you last saw
2. Look for the **"📌 BOOKMARK TO RESUME"** section
3. Use `Ctrl+F` (or `Cmd+F` on Mac) to search for the bookmark text
4. You'll jump right to where you left off!

**Pro Tip:** Each break point card shows:
- ✅ What you've completed
- ⏭️ What's coming next
- ⏱️ Estimated time for the next section

Feel free to work at your own pace—these are suggestions, not requirements! 🚀
</div>

---


### 🎨 Technique Spotlight: Strategic Combinations

Here's how Module 2 tactics combine to solve real SDLC challenges:

| **Technique** | **Purpose in SDLC Context** | **Prompting Tip** |
|---------------|----------------------------|-------------------|
| **Task Decomposition** | Break multifaceted engineering tasks (e.g., review + test suggestions) into manageable parts | Structure prompt into numbered steps or XML blocks (e.g., `<review>`, `<tests>`) |
| **Role Prompting** | Align the model's persona with engineering expectations (e.g., "Senior Backend Engineer") | Specify domain, experience level, and evaluation criteria |
| **Chain-of-Thought** | Ensure reasoning is visible, aiding traceability and auditing | Request structured reasoning before conclusions, optionally hidden using "inner monologue" tags |
| **LLM-as-Judge** | Evaluate code changes or generated artifacts against standards | Provide rubric with weighted criteria and evidence requirement |
| **Few-Shot Examples** | Instill preferred review tone, severity labels, or test formats | Include short exemplars with both input (`<diff>`, `<tests>`) and expected reasoning |
| **Prompt Templates** | Reduce prompt drift across teams and tools | Parameterize sections (`{{code_diff}}`, `{{requirements}}`) for consistent reuse |

#### 🔗 The Power of Strategic Combinations

The real skill isn't using tactics in isolation—it's knowing **which combinations solve which problems**. Each section demonstrates a different combination pattern optimized for specific SDLC challenges.

Ready to build production-ready solutions? Let's dive in! 👇


## 🔍 Section 1: Code Review Automation Template

### Building a Comprehensive Code Review Prompt with Multi-Tactic Combination

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">🎯 What You'll Build in This Section</strong><br><br>

You'll create a **production-ready code review prompt template** that automatically analyzes code changes with the rigor of a senior engineer. This isn't just about finding bugs but rather you're building a system that provides consistent, traceable, and actionable feedback.

**Time Required:** ~40 minutes (includes building, testing, and refining the template)
</div>

#### 📋 Before You Start: What You'll Need

To get the most from this section, have ready:

1. **A code diff to review** (options):
   - A recent pull request from your repository
   - Sample code provided in the activities below
   - Any Python, JavaScript, or Java code change you want analyzed

2. **Clear review criteria** for your domain:
   - What counts as a "blocker" vs "minor" issue in your team?
   - Which security patterns should be enforced?
   - What performance thresholds matter for your application?

3. **Your API connection** set up and tested (from the setup section above)

<div style="background:#dbeafe; border-left:4px solid #3b82f6; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#1e40af;">💡 Why This Approach Works with Modern LLMs</strong><br><br>

This template follows industry best practices for prompt engineering with advanced language models. According to [Claude 4 prompt engineering best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices), modern LLMs excel when you:

- **Be explicit about expectations** - We'll define exactly what constitutes each severity level
- **Provide context for behavior** - Explain *why* certain patterns are problematic (e.g., "SQL injection vulnerabilities allow attackers to access sensitive data")
- **Use structured formats** - XML tags help models maintain focus across complex multi-step analyses
- **Encourage visible reasoning** - Chain-of-thought reveals the "why" behind each finding, making reviews auditable

These aren't arbitrary choices—they directly address how advanced language models process instructions most effectively, ensuring consistent results across different AI providers.
</div>

#### 🎯 The Problem We're Solving

Manual code reviews face three critical challenges:

1. **⏰ Time Bottlenecks** 
   - Senior engineers spend 8-12 hours/week reviewing PRs
   - Review queues delay feature delivery by 2-3 days on average
   - **Impact:** Slower velocity, frustrated developers

2. **🎯 Inconsistent Standards**
   - Different reviewers prioritize different concerns
   - New team members lack institutional knowledge
   - Review quality varies based on reviewer fatigue
   - **Impact:** Technical debt accumulates, security gaps emerge

3. **📝 Lost Knowledge**
   - Review reasoning buried in PR comments
   - No searchable audit trail for security decisions
   - Hard to train junior developers on review standards
   - **Impact:** Repeated mistakes, difficult compliance auditing

#### ✨ Understanding Prompt Templates

According to [prompt templating best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables), effective prompts separate **fixed content** (static instructions) from **variable content** (dynamic inputs). This separation enables:

**Key Benefits:**
- **Consistency** - Same review standards applied every time
- **Efficiency** - Swap inputs without rewriting instructions
- **Testability** - Quickly test different code diffs
- **Scalability** - Manage complexity as your application grows
- **Version Control** - Track changes to prompt logic separately from data

**How to Templatize:**
1. **Identify fixed content** - Instructions that never change (e.g., "Act as a Senior Backend Engineer")
2. **Identify variable content** - Dynamic data that changes per request (e.g., code diffs, repository names)
3. **Use placeholders** - Mark variables with `{{double_brackets}}` for easy identification
4. **Separate concerns** - Keep prompt logic in templates, data in variables

**Example:**
```
Fixed: "Review this code for security issues"
Variable: {{code_diff}} ← Changes with each API call
Template: "Review this code for security issues: {{code_diff}}"
```

#### 🏗️ How We'll Build It: The Tactical Combination

This template strategically combines five Module 2 tactics:

| **Tactic** | **Purpose in This Template** | **Why Modern LLMs Need This** |
|------------|------------------------------|-------------------------------|
| **Role Prompting** | Establishes "Senior Backend Engineer" perspective with specific expertise | LLMs respond better when given explicit expertise context rather than assuming generic knowledge |
| **Structured Inputs (XML)** | Separates code, context, and guidelines into clear sections | Prevents models from mixing different information types during analysis |
| **Task Decomposition** | Breaks review into 4 sequential steps (Think → Assess → Suggest → Verdict) | Advanced LLMs excel at following explicit numbered steps rather than implicit workflows |
| **Chain-of-Thought** | Makes reasoning visible in Analysis section | Improves accuracy by forcing deliberate analysis before conclusions |
| **Structured Output** | Uses readable markdown format with severity levels | Enables human readability while maintaining parseable structure for automation |

<div style="background:#dcfce7; border-left:4px solid #22c55e; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#166534;">🚀 Let's Build It!</strong><br><br>

In the next cell, you'll see the complete template structure. **Pay special attention to**:
- How we use explicit language to define severity levels (not "bad code" but "allows SQL injection")
- Why the markdown output format is more readable than XML while still being parseable
- How parameters like `{{tech_stack}}` and `{{change_purpose}}` make the template reusable across projects
- How the 6 review dimensions (Security, Performance, Error Handling, etc.) ensure comprehensive analysis

After reviewing the template, you'll test it on real code and see how each tactic contributes to the result.
</div>


### 📋 Template Structure

```xml
<role>
Act as a Senior Backend Engineer specializing in {{tech_stack}}.
</role>

<context>
Repository: {{repo_name}}
Service: {{service_name}}
Purpose: {{change_purpose}}
</context>

<code_diff>
{{code_diff}}
</code_diff>

<review_guidelines>
Evaluate the code across these critical dimensions:

1. **Security**: Check for vulnerabilities (SQL injection, XSS, insecure dependencies, exposed secrets)
2. **Performance**: Identify bottlenecks (N+1 queries, memory leaks, inefficient algorithms)
3. **Error Handling**: Validate proper exception handling and edge case coverage
4. **Code Quality**: Assess readability, simplicity, and adherence to standards
5. **Correctness**: Verify logic achieves intended functionality
6. **Maintainability**: Check for unnecessary complexity or dependencies

For each finding:
- Cite exact lines using git diff markers
- Explain why it's problematic (impact on users, security, or system)
- If code is acceptable, confirm with specific justification
</review_guidelines>

<tasks>
Step 1 - Think: Analyze the code systematically using chain-of-thought reasoning in the Analysis section.
         Consider:
         • What could go wrong with this code?
         • Are there security implications?
         • How does this perform at scale?
         • Are edge cases handled?

Step 2 - Assess: For each issue identified, provide:
  • Severity: 
    - BLOCKER: Security vulnerabilities, data loss risks, critical bugs
    - MAJOR: Performance issues, poor error handling, significant technical debt
    - MINOR: Code style inconsistencies, missing comments, small optimizations
    - NIT: Formatting, naming conventions, trivial improvements
  • Description: What is the issue and why it matters
  • Evidence: Specific line numbers and code excerpts
  • Impact: Potential consequences (security risk, performance degradation, etc.)

Step 3 - Suggest: Provide actionable remediation:
  • Specific code improvements or refactoring
  • Alternative approaches to consider
  • Questions for the author about design decisions

Step 4 - Verdict: Conclude with clear decision:
  • Pass/Fail/Needs Discussion
  • Summary of key findings
  • Required actions before merge
</tasks>

<output_format>
Provide your review in clear markdown format:

## 🧠 Analysis
[Your reasoning about potential issues - what patterns concern you?]

## 🔍 Findings

### [SEVERITY] Issue Title
**Lines:** [specific line numbers]
**Problem:** [what's wrong and why it matters]
**Impact:** [consequences - security risk, performance, etc.]
**Fix:** [specific recommendation or code suggestion]

[Repeat for each issue found]

## ✅ Verdict
**Decision:** [PASS / FAIL / NEEDS_DISCUSSION]
**Summary:** [Brief overview of review]
**Required Actions:** [What must be done before merge]
</output_format>
```

#### 🎯 What Makes This Production-Ready?

✅ **Comprehensive Review Dimensions** - Covers Security, Performance, Error Handling, Code Quality, Correctness, and Maintainability (not just "find bugs")

✅ **Clear Severity Definitions** - Explicit criteria for BLOCKER/MAJOR/MINOR/NIT classifications prevent ambiguity

✅ **Impact Analysis** - Every finding explains *why* it matters (security risk, performance degradation, maintainability issues)

✅ **Actionable Guidance** - Prompts for specific code improvements, not vague suggestions

✅ **Decision Framework** - Pass/Fail/Needs Discussion verdict with required actions before merge

✅ **Readable Output Format** - Uses clean markdown instead of verbose XML for better human readability and easier integration with PR tools

These additions ensure reviews are consistent, auditable, and aligned with production quality standards.

---

### 💻 Working Example: Reviewing a Security Vulnerability

Let's apply our enhanced template to a real-world scenario - a code change that introduces a SQL injection vulnerability.


In [None]:
# Example: Security-Focused Code Review with Enhanced Template
code_diff = """
+ def get_user_by_email(email):
+     query = f"SELECT * FROM users WHERE email = '{email}'"
+     cursor.execute(query)
+     return cursor.fetchone()
"""

messages = [
    {
        "role": "system",
        "content": "You are a Senior Security Engineer specializing in application security and OWASP Top 10 vulnerabilities."
    },
    {
        "role": "user",
        "content": f"""
<context>
Repository: user-service-api
Service: Authentication Service
Purpose: Add email-based user lookup for login feature
Security Context: This service handles sensitive user authentication data and is exposed to external API requests
</context>

<code_diff>
{code_diff}
</code_diff>

<review_guidelines>
Evaluate the code with emphasis on security vulnerabilities, following [AWS security scanning best practices](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/security-scan.md):

**Primary Focus - Security:**
- OWASP Top 10 vulnerabilities (Injection, Authentication, XSS, etc.)
- Input validation and sanitization
- Authentication and authorization flaws
- Sensitive data exposure
- Known CVE/CWE patterns

**Secondary Considerations:**
- Performance implications of security fixes
- Error handling (avoid information leakage)
- Code quality and maintainability
- Correctness of implementation

For each security finding:
- Identify the vulnerability type and CWE/CVE reference if applicable
- Cite exact lines using git diff markers
- Explain the attack vector and potential impact
- Provide secure coding remediation with examples
</review_guidelines>

<tasks>
Step 1 - Security Analysis: Systematically analyze for vulnerabilities in the Analysis section.
         Consider:
         • What attack vectors exist in this code?
         • Which OWASP Top 10 categories apply?
         • What is the blast radius if exploited?
         • Are there any CWE patterns present?

Step 2 - Vulnerability Assessment: For each security issue, provide:
  • Severity (Security-focused): 
    - CRITICAL: Remote code execution, authentication bypass, SQL injection allowing data exfiltration
    - HIGH: Privilege escalation, XSS, insecure deserialization, significant data exposure
    - MEDIUM: Information disclosure, missing security headers, weak encryption
    - LOW: Security misconfigurations with limited impact, verbose error messages
  • Vulnerability Type: (e.g., "SQL Injection - CWE-89")
  • OWASP Category: (e.g., "A03:2021 - Injection")
  • Evidence: Specific vulnerable code with line numbers
  • Attack Scenario: How an attacker could exploit this
  • Impact: Data breach potential, system compromise, compliance violations

Step 3 - Security Remediation: Provide secure alternatives:
  • Specific secure code implementation
  • Reference to security libraries/frameworks (e.g., parameterized queries, ORM)
  • Defense-in-depth recommendations
  • Security testing suggestions

Step 4 - Security Verdict: Conclude with risk assessment:
  • Decision: BLOCK / FIX_REQUIRED / NEEDS_SECURITY_REVIEW / APPROVE_WITH_CONDITIONS
  • Risk Summary: Overall security posture assessment
  • Required Actions: Security fixes that must be implemented before deployment
</tasks>

<output_format>
Provide your security review in clear markdown format:

## 🔒 Security Analysis
[Your reasoning about security vulnerabilities - what attack vectors exist?]

## 🚨 Security Findings

### [SEVERITY] Vulnerability Type - CWE-XXX
**Lines:** [specific line numbers]
**OWASP Category:** [e.g., A03:2021 - Injection]
**Vulnerability:** [description of the security flaw]
**Attack Scenario:** [how an attacker exploits this]
**Impact:** [data breach, system compromise, compliance violation]
**Secure Fix:** [specific code solution with security best practices]

[Repeat for each vulnerability found]

## ✅ Security Verdict
**Risk Level:** [CRITICAL / HIGH / MEDIUM / LOW]
**Decision:** [BLOCK / FIX_REQUIRED / NEEDS_SECURITY_REVIEW / APPROVE_WITH_CONDITIONS]
**Summary:** [Overall security assessment]
**Required Security Actions:** [Must-fix items before deployment]
</output_format>
"""
    }
]

print("🔒 SECURITY-FOCUSED CODE REVIEW IN PROGRESS...")
print("="*70)
review_result = get_chat_completion(messages, temperature=0.0)
print(review_result)
print("="*70)


### 🏋️ Activity: Build Your Own Code Review Template

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">⏱️ Time Required: 35-50 minutes</strong><br>
This is a hands-on research and build activity. You'll explore professional code review patterns and create your own template.
</div>

#### 📖 What You'll Do

This activity challenges you to **research, design, and build** a production-ready code review template by studying real-world patterns from AWS.

#### 📋 Instructions

Follow the **3-step process** in the code cell below:

1. **RESEARCH (10-15 min)** - Study the AWS code review pattern and identify key elements
2. **DESIGN (10-15 min)** - Answer design questions to plan your template structure  
3. **BUILD (15-20 min)** - Implement your template by adapting the starter code



---

<div style="background:#f0f9ff; border-left:4px solid #3b82f6; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 📋 STEP 1 - RESEARCH (10-15 minutes)

**📖 READ THE AWS CODE REVIEW PATTERN:**
   
👉 [AWS Anthropic Code Review Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md)

**🔍 KEY THINGS TO LOOK FOR:**
- ✓ How do they structure code review prompts?
- ✓ What review dimensions do they cover? (Security, Performance, Quality, etc.)
- ✓ What severity levels do they use and how are they defined?
- ✓ What output format do they recommend?
- ✓ How do they ensure actionable feedback?

</div>

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 💭 STEP 2 - DESIGN YOUR TEMPLATE (10-15 minutes)

**ANSWER THESE QUESTIONS BEFORE CODING:**

**1️⃣  ROLE:** What expertise should the AI have?
   - 💡 *Hint: This is a Python authentication function - what type of engineer should review it?*

**2️⃣  CONTEXT:** What information helps the AI understand the code?
   - Repository and service name?
   - Purpose of the code change?
   - Technology stack specifics?
   - Security requirements?

**3️⃣  REVIEW DIMENSIONS:** What aspects should be evaluated?
   
   Consider the 6 dimensions from earlier in this notebook:
   - **Security** (SQL injection, password handling, input validation)
   - **Performance** (database queries, caching)
   - **Error Handling** (exceptions, edge cases)
   - **Code Quality** (readability, maintainability)
   - **Correctness** (authentication logic)
   - **Best Practices** (Python idioms, security standards)

**4️⃣  OUTPUT FORMAT:** How should findings be presented?
   - Markdown vs XML?
   - What sections are needed?
   - How to structure individual findings?
   - What makes feedback actionable?

</div>

<div style="background:#dcfce7; border-left:4px solid #22c55e; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 🔨 STEP 3 - BUILD YOUR TEMPLATE (15-20 minutes)

**YOUR TASK:**

⚠️ **Edit the starter template in the code cell below** by replacing all *TODO* sections with your own design based on your research in Steps 1 & 2.

The starter template provides the basic structure - you need to enhance it by:
1. Improving the role definition
2. Adding relevant context
3. Expanding review guidelines with specific checks
4. Structuring tasks with clear steps
5. Designing an effective output format

**💡 TIP:** Look at the complete examples in Cells 9 and 11 to see how all pieces fit together!

</div>

---


In [None]:
# ╔══════════════════════════════════════════════════════════════════════════════╗
# ║  PRACTICE ACTIVITY CODE - Follow Steps 1-3 in the markdown cell above        ║
# ╚══════════════════════════════════════════════════════════════════════════════╝

# ╔══════════════════════════════════════════════════════════════════════════════╗
# ║ CODE TO REVIEW: Python authentication function with multiple security issues ║
# ╚══════════════════════════════════════════════════════════════════════════════╝

practice_code = """
+ import hashlib
+ 
+ def authenticate_user(username, password):
+     # Connect to database
+     query = "SELECT * FROM users WHERE username = '" + username + "'"
+     user = db.execute(query)
+     
+     # Hash the password
+     hashed = hashlib.md5(password.encode()).hexdigest()
+     
+     # Check password
+     if user['password'] == hashed:
+         return user
+     return None
"""

#═══════════════════════════════════════════════════════════════════════════════
# ⚠️  STARTER TEMPLATE - EDIT ALL TODO SECTIONS BELOW ⚠️
#═══════════════════════════════════════════════════════════════════════════════
# This is a basic template to get you started. Your task is to enhance it by:
# 1. Improving the role definition
# 2. Adding relevant context
# 3. Expanding review guidelines with specific checks
# 4. Structuring tasks with clear steps
# 5. Designing an effective output format
#═══════════════════════════════════════════════════════════════════════════════

practice_messages = [
    {
        "role": "system",
        # ⚠️ TODO: Change this role based on what you learned from AWS patterns
        # 💡 Hint: What expertise is needed to review authentication code?
        #          Consider: "Security Engineer"? "Senior Backend Engineer"?
        "content": "You are a Senior Software Engineer."
    },
    {
        "role": "user", 
        "content": f"""
<context>
Repository: user-authentication-service
Service: Authentication API
Purpose: Add user login authentication endpoint

<!-- ⚠️ TODO #1: Add more context here based on AWS patterns
     Examples to consider:
     • Security requirements: OWASP compliance? PCI-DSS?
     • Technology stack: Python 3.x, PostgreSQL/MySQL, Flask/Django?
     • Authentication standards: OAuth 2.0? JWT tokens?
     • Deployment environment: AWS? On-premise?
-->
</context>

<code_diff>
{practice_code}
</code_diff>

<review_guidelines>
Evaluate the code across these critical dimensions:

1. **Security**: Check for vulnerabilities (SQL injection, weak hashing, input validation)
2. **Performance**: Identify scalability issues (database queries, caching opportunities)
3. **Error Handling**: Validate exception handling (try-catch, edge cases)
4. **Code Quality**: Assess readability and maintainability
5. **Correctness**: Verify authentication logic works as intended
6. **Best Practices**: Check Python and security standards (OWASP guidelines)

<!-- ⚠️ TODO #2: Enhance these guidelines based on AWS code review patterns
     Reference: https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md
     
     Consider adding:
     • Specific vulnerability types to check (e.g., "Check for CWE-89: SQL Injection")
     • Clear severity definitions (e.g., "CRITICAL: Remote code execution, data breach")
     • Evidence requirements (e.g., "Cite exact line numbers and code excerpts")
     • How to explain impact (e.g., "Explain attack vector and business impact")
     • Format for secure code examples (e.g., "Provide parameterized query alternative")
-->
</review_guidelines>

<tasks>
<!-- ⚠️ TODO #3: Define the review steps based on AWS code review patterns
     Think about:
     • Step 1 (Analysis): Should you request chain-of-thought reasoning?
     • Step 2 (Assessment): How should issues be categorized? (BLOCKER/MAJOR/MINOR/NIT?)
     • Step 3 (Recommendations): What makes fixes actionable? (code examples? alternatives?)
     • Step 4 (Verdict): What decision format? (Pass/Fail/Needs Discussion?)
-->

Step 1 - Analyze: Systematically examine code for issues across all dimensions
         [⚠️ Add specific analysis questions here - what should the LLM consider?]

Step 2 - Assess: For each issue found, provide:
         [⚠️ Define severity levels with concrete criteria]
         [⚠️ Specify what evidence is needed]
         [⚠️ Explain how to describe impact]

Step 3 - Recommend: Provide actionable fixes
         [⚠️ Define what makes recommendations actionable]

Step 4 - Verdict: Conclude with clear decision
         [⚠️ Specify decision format and required summary]
</tasks>

<output_format>
<!-- ⚠️ TODO #4: Design your output format based on AWS code review patterns
     Reference: Look at the template in Cell 9 or Cell 11 for inspiration
     
     Consider:
     • Markdown (recommended) or XML?
     • Main sections needed? (Analysis? Findings? Verdict?)
     • How should individual issues be structured?
     • What makes the output actionable and readable for engineers?
-->

Provide your review in clear format:

## [⚠️ Your Analysis Section Name]
[⚠️ What goes here? Think about chain-of-thought reasoning]

## [⚠️ Your Findings Section Name]
[⚠️ How are issues structured? What information is essential?]

### [SEVERITY] Issue Title
**Lines:** [⚠️ Specify what line information is needed]
**Problem:** [⚠️ Define how to explain the issue]
**Impact:** [⚠️ Define how to explain consequences]
**Fix:** [⚠️ Define how to provide recommendations]

## [⚠️ Your Verdict Section Name]
[⚠️ What final information helps decision-making?]
</output_format>
"""
    }
]

#═══════════════════════════════════════════════════════════════════════════════
# 🧪 TEST YOUR TEMPLATE - Uncomment when you've completed all TODO sections
#═══════════════════════════════════════════════════════════════════════════════

# print("🔍 TESTING YOUR CODE REVIEW TEMPLATE")
# print("="*70)
# result = get_chat_completion(practice_messages, temperature=0.0)
# print(result)
# print("="*70)

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║                           💡 HINTS FOR SUCCESS                               ║
╚══════════════════════════════════════════════════════════════════════════════╝

📋 CODE REVIEW ELEMENTS TO INCLUDE:
   ✓ Clear severity definitions: BLOCKER (security vulnerabilities), MAJOR, MINOR, NIT
   ✓ Evidence citations: Line numbers and specific code excerpts
   ✓ Impact explanation: Why the issue matters (security breach, data loss, etc.)
   ✓ Actionable recommendations: Specific code fixes with secure alternatives
   ✓ Reasoning transparency: Include analysis section showing thought process

⚠️  CRITICAL ISSUES IN THIS AUTHENTICATION CODE:
   • SQL Injection vulnerability (line 5 - string concatenation in query)
   • Weak password hashing (line 8 - MD5 is cryptographically broken)
   • Missing error handling (no try-catch, no validation)
   • Information leakage (no distinction between "user not found" vs "wrong password")
   • No input validation (username/password could be empty, malicious)
   • Missing security best practices (no rate limiting, no password complexity)

❓ SELF-CHECK QUESTIONS:
   → Does my prompt cover all 6 review dimensions?
   → Does it prioritize security issues appropriately for authentication code?
   → Does it request specific evidence (line numbers, vulnerable code excerpts)?
   → Does it ask for secure code examples (parameterized queries, bcrypt)?
   → Are severity levels well-defined with concrete security impact?
   → Does it use a clear, readable output format (markdown recommended)?

╔══════════════════════════════════════════════════════════════════════════════╗
║                            🎯 NEXT CHALLENGES                                ║
╚══════════════════════════════════════════════════════════════════════════════╝

After creating your template, extend your learning:
   
   1️⃣  Test it on different code samples (frontend, backend, different languages)
   2️⃣  Create specialized variants:
       • Security-only (reference: AWS security-scan.md pattern)
       • Performance-only (reference: AWS analyze-performance.md pattern)
   3️⃣  Compare with the complete examples in Cells 9 and 11
       • What did you do similarly? Differently?
       • Which approach works better for your use case?

📚 REFERENCE CELLS:
   • Cell 9: General code review template with markdown output
   • Cell 11: Security-focused review example with OWASP categories
   • Cell 14: The 3-step process for this practice activity
""")


---

### 📚 Learn More: Advanced Code Review Patterns

Want to dive deeper into production code review automation? Explore these resources:

**📖 AWS Anthropic Advanced Patterns**
- [Code Review Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md) - Production-ready patterns for AI-powered code review
- Covers advanced topics like multi-file reviews, security-focused analysis, and CI/CD integration

**🔗 Related Best Practices**
- [Claude 4 Prompt Engineering Best Practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) - Core prompting techniques
- [Prompt Templates and Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables) - Parameterization strategies

**💡 What You Can Build Next:**
- Integrate this template into your CI/CD pipeline (GitHub Actions, GitLab CI)
- Create specialized variants (security-only reviews, performance-only reviews)
- Build a review bot that automatically comments on pull requests
- Develop custom severity criteria tailored to your team's standards

<div style="margin-top:16px; color:#15803d; padding:12px; background:#dcfce7; border-radius:6px; border-left:4px solid #22c55e;">
<strong>🎯 Pro Tip:</strong> The AWS patterns repository includes examples of integrating these templates with AWS Lambda, CodeCommit, and other cloud services. Great for enterprise deployments!
</div>

---

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 24px; border-radius: 12px; margin: 40px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
  <div style="text-align: center; margin-bottom: 20px;">
    <h2 style="color: white; margin: 0; font-size: 1.8em; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">☕ Suggested Break Point #1</h2>
    <p style="margin: 8px 0; font-size: 1.1em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">~40 minutes elapsed</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">✅ Completed:</p>
    <ul style="margin: 8px 0; padding-left: 24px; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Section 1: Code Review Automation Template</li>
      <li>Built production-ready code review prompts</li>
      <li>Practiced with security vulnerability detection</li>
      <li>Reviewed React component for performance issues</li>
    </ul>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">⏭️ Coming Next:</p>
    <ul style="margin: 8px 0; padding-left: 24px; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Section 2: Test Case Generation Template</li>
      <li>Coverage gap identification</li>
      <li>Smart test plan creation</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">⏱️ Next section: ~30-35 minutes</p>
  </div>
  
  <div style="background: rgba(255,255,255,0.95); padding: 14px; border-radius: 8px; margin: 16px 0; text-align: center; color: #1e293b;">
    <p style="margin: 0; font-weight: bold; font-size: 1.1em; color: #1e293b;">📌 BOOKMARK TO RESUME:</p>
    <p style="margin: 8px 0 0 0; font-size: 1.15em; font-weight: bold; color: #0f172a;">"Section 2: Test Case Generation Template"</p>
  </div>
  
  <p style="text-align: center; margin: 16px 0 0 0; font-size: 0.9em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">
    💡 <em>This is a natural stopping point. Feel free to take a break and return later!</em>
  </p>
</div>

---


## 🧪 Section 2: Test Generation Automation Template

### Building a Comprehensive Test Generation Prompt with Multi-Tactic Combination

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">🎯 What You'll Build in This Section</strong><br><br>

You'll create a **production-ready test generation prompt template** that automatically produces comprehensive test suites by analyzing requirements and identifying coverage gaps. This isn't just about writing happy-path tests—you're building a system that uncovers edge cases, flags ambiguities, and produces actionable test specifications.

**Time Required:** ~40 minutes (includes building, testing, and refining the template)
</div>

#### 📋 Before You Start: What You'll Need

To get the most from this section, have ready:

1. **Requirements to test** (options):
   - A feature spec from your current sprint
   - User stories with acceptance criteria
   - Sample requirements provided in the activities below
   - Any vague or ambiguous requirements that need clarification

2. **Context about your test strategy**:
   - What test types does your team write? (unit, integration, E2E)
   - What test framework do you use? (pytest, Jest, JUnit)
   - What makes a good test specification in your workflow?

3. **Your API connection** set up and tested (from the setup section above)

<div style="background:#dbeafe; border-left:4px solid #3b82f6; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#1e40af;">💡 Why This Approach Works with Modern LLMs</strong><br><br>

This template follows industry best practices for prompt engineering with advanced language models. According to [Claude 4 prompt engineering best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices), modern LLMs excel when you:

- **Structure the analysis process** - We'll decompose test generation into clear steps: analyze requirements → identify gaps → generate specs → document infrastructure needs
- **Request explicit reasoning** - Chain-of-thought helps the model explain *why* certain edge cases matter (e.g., "Testing expiration at midnight requires timezone handling")
- **Use systematic frameworks** - Categorizing tests by type (unit/integration) and coverage dimension (happy path/edge case/error path) produces more thorough results
- **Flag ambiguities proactively** - Encouraging the model to question unclear requirements prevents wasted testing effort on wrong assumptions

These aren't arbitrary choices—they directly address how advanced language models process instructions most effectively, ensuring comprehensive test coverage across different AI providers.
</div>

#### 🎯 The Problem We're Solving

Manual test planning faces three critical challenges:

1. **📋 Incomplete Coverage**
   - Easy to miss edge cases and error paths
   - Boundary conditions often overlooked (0%, 100%, empty inputs)
   - Security and performance test scenarios forgotten
   - **Impact:** Bugs slip through to production, customer trust erodes

2. **⏰ Time Pressure**
   - Testing gets squeezed at the end of sprints
   - QA teams struggle to keep up with feature velocity
   - Test planning rushed, documentation minimal
   - **Impact:** Technical debt in test suites, maintenance nightmares

3. **🎲 Missed Ambiguities**
   - Unclear requirements don't get questioned until implementation
   - Assumptions made without validation
   - Integration points and dependencies discovered late
   - **Impact:** Rework, missed deadlines, scope creep

#### 🏗️ How We'll Build It: The Tactical Combination

| Tactic | Purpose | Implementation |
|--------|---------|----------------|
| **Role Prompting** | Assign QA expertise | "You are a QA Automation Lead with expertise in {{tech_stack}}" |
| **Structured Inputs** | Organize requirements & existing tests | XML tags: `<requirements>`, `<existing_tests>` |
| **Task Decomposition** | Break down test generation process | Numbered steps: Analyze → Identify Gaps → Generate Tests → Document Dependencies |
| **Chain-of-Thought** | Encourage reasoning about coverage | Request explicit analysis of gaps and ambiguities |
| **Structured Output** | Enable automation | Markdown format with sections for different test types |


### 📋 Test Generation Template Structure

<div style="background:#eff6ff; border-left:4px solid #3b82f6; padding:16px; border-radius:6px; margin:20px 0; color:#1e293b;">
<strong>🔨 Let's Build It</strong><br><br>

We'll construct this template by:
1. **Defining the QA role** with specific tech stack expertise
2. **Structuring inputs** using XML tags for requirements and existing test context
3. **Decomposing the task** into: Analyze → Identify Gaps → Generate Tests → Document Dependencies
4. **Requesting chain-of-thought** for coverage analysis
5. **Specifying markdown output** for test plans (replacing verbose XML with readable format)
6. **Adding parameters** (`{{tech_stack}}`, `{{requirements}}`, `{{existing_tests}}`) for reusability

This template draws inspiration from [AWS Anthropic test generation patterns](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md), adapted for clarity and automation.
</div>

```xml
<role>
You are a QA Automation Lead with expertise in {{tech_stack}}.
</role>

<requirements>
{{functional_requirements}}
</requirements>

<existing_tests>
{{test_suite_overview}}
</existing_tests>

<tasks>
1. Analyze the requirements and existing test coverage
2. Identify coverage gaps across these dimensions:
   - Missing scenarios (happy paths, edge cases, error paths)
   - Business rule validation
   - Data boundary conditions
   - Concurrent/async behavior
   - Security concerns (auth, input validation)
   - Performance considerations

3. For each identified gap, generate test specifications including:
   - Test name (descriptive, follows naming conventions)
   - Purpose (what does this test verify?)
   - Test type (unit, integration, e2e)
   - Preconditions (required setup, test data, mocks)
   - Steps (execution sequence)
   - Expected outcome (assertions, success criteria)

4. Categorize tests by type and document dependencies

5. Flag ambiguities in requirements that need clarification
</tasks>

<output_format>
Provide your test plan in clear markdown format:

## 🔍 Analysis
[Your reasoning about requirements and existing coverage - what patterns do you see?]

## ⚠️ Ambiguities
[Requirements that need clarification before testing]

## 📊 Coverage Gaps
[What's missing from current test suite?]

## 🧪 Unit Tests

### Test: [Descriptive Name]
**Purpose:** [What this test verifies]
**Preconditions:** [Required setup]
**Steps:**
1. [Action]
2. [Action]
**Expected:** [Success criteria]

[Repeat for each unit test]

## 🔗 Integration Tests

### Test: [Descriptive Name]
**Purpose:** [What this test verifies]
**Preconditions:** [Required setup]
**Steps:**
1. [Action]
2. [Action]
**Expected:** [Success criteria]

[Repeat for each integration test]

## 🛠️ Test Infrastructure Needs
[Mocks, fixtures, test data, environment dependencies]
</output_format>
```


### 💻 Working Example: Payment Service Test Generation

Let's generate comprehensive tests for a payment processing service.


In [None]:
# Example: Test Case Generation for Payment Service

functional_requirements = """
Payment Processing Requirements:
1. Process credit card payments with validation
2. Handle multiple currencies (USD, EUR, GBP)
3. Apply discounts and calculate tax
4. Generate transaction receipts
5. Handle payment failures and retries (max 3 attempts)
6. Send confirmation emails on success
7. Log all transactions for audit compliance
8. Support payment refunds within 30 days
"""

existing_tests = """
Current Test Suite (payment_service_test.py):
- test_process_valid_payment() - Happy path for USD payments
- test_invalid_card_number() - Validates card number format
- test_calculate_tax() - Tax calculation for US region only
"""

test_messages = [
    {
        "role": "system",
        "content": "You are a QA Automation Lead with expertise in Python testing frameworks (pytest)."
    },
    {
        "role": "user",
        "content": f"""
<requirements>
{functional_requirements}
</requirements>

<existing_tests>
{existing_tests}
</existing_tests>

<tasks>
1. Analyze the requirements and existing test coverage
2. Identify coverage gaps across these dimensions:
   - Missing scenarios (happy paths, edge cases, error paths)
   - Business rule validation
   - Data boundary conditions
   - Concurrent/async behavior
   - Security concerns (auth, input validation)
   - Performance considerations

3. For each identified gap, generate test specifications including:
   - Test name (descriptive, follows naming conventions)
   - Purpose (what does this test verify?)
   - Test type (unit, integration, e2e)
   - Preconditions (required setup, test data, mocks)
   - Steps (execution sequence)
   - Expected outcome (assertions, success criteria)

4. Categorize tests by type and document dependencies

5. Flag ambiguities in requirements that need clarification
</tasks>

<output_format>
Provide your test plan in clear markdown format:

## 🔍 Analysis
[Your reasoning about requirements and existing coverage - what patterns do you see?]

## ⚠️ Ambiguities
[Requirements that need clarification before testing]

## 📊 Coverage Gaps
[What's missing from current test suite?]

## 🧪 Unit Tests

### Test: [Descriptive Name]
**Purpose:** [What this test verifies]
**Preconditions:** [Required setup]
**Steps:**
1. [Action]
2. [Action]
**Expected:** [Success criteria]

[Repeat for each unit test]

## 🔗 Integration Tests

### Test: [Descriptive Name]
**Purpose:** [What this test verifies]
**Preconditions:** [Required setup]
**Steps:**
1. [Action]
2. [Action]
**Expected:** [Success criteria]

[Repeat for each integration test]

## 🛠️ Test Infrastructure Needs
[Mocks, fixtures, test data, environment dependencies]
</output_format>
"""
    }
]

print("🧪 TEST GENERATION IN PROGRESS...")
print("="*70)
test_result = get_chat_completion(test_messages, temperature=0.0)
print(test_result)
print("="*70)


### 🏋️ Practice Activity: Build Your Own Test Generation Template

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:16px; border-radius:6px; margin:20px 0; color:#000000;">
<strong style="color:#92400e;">⏱️ Time Required: 35-50 minutes</strong><br>
This is a hands-on research and build activity. You'll explore professional test generation patterns and create your own template.
</div>

#### 📖 What You'll Do

This activity challenges you to **research, design, and build** a production-ready test generation template by studying real-world patterns from AWS. You'll work with ambiguous requirements for a shopping cart discount system - a perfect scenario for showcasing comprehensive test planning.

#### 🎯 Learning Objectives

By completing this activity, you will:
- ✅ Learn how to research and adapt professional test generation patterns
- ✅ Understand how to identify coverage gaps and ambiguities in requirements
- ✅ Practice designing structured test plans with unit and integration tests
- ✅ Build a reusable template for automated test case generation

#### 📋 The Scenario

A product manager has provided vague requirements for a new feature:

**Feature: Shopping Cart Discount System**
- Users can apply discount codes at checkout
- Some discounts are percentage-based, others are fixed amounts
- Discounts have expiration dates
- Some codes are one-time use, others unlimited
- Discounts can't be combined

**Existing Test Coverage:**
```python
# Current test suite:
- test_apply_percentage_discount() # 10% off $100 cart
- test_fixed_amount_discount()     # $5 off $50 cart
```

**Your Challenge:** These requirements are intentionally vague! Your template should identify ambiguities, generate edge cases, and produce comprehensive test specifications.

#### 🔍 Code Sample for Testing

Below you'll find the discount system requirements with minimal existing coverage. Use this as your test case while building your template.


---

<div style="background:#f0f9ff; border-left:4px solid #3b82f6; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 📋 STEP 1 - RESEARCH (10-15 minutes)

**📖 READ THE AWS TEST GENERATION PATTERN:**
   
👉 [AWS Anthropic Test Generation Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md)

**🔍 KEY THINGS TO LOOK FOR:**
- ✓ How do they structure test generation prompts?
- ✓ What dimensions do they analyze? (happy paths, edge cases, error paths)
- ✓ How do they handle ambiguous requirements?
- ✓ What output format do they recommend for test specifications?
- ✓ How do they categorize tests (unit vs integration)?

</div>

<div style="background:#fef3c7; border-left:4px solid #f59e0b; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 💭 STEP 2 - DESIGN YOUR TEMPLATE (10-15 minutes)

**ANSWER THESE QUESTIONS BEFORE CODING:**

**1️⃣  ROLE:** What expertise should the AI have?
   - 💡 *Hint: This is an e-commerce discount system - what type of QA engineer should test it?*

**2️⃣  INPUTS:** What information helps the AI generate comprehensive tests?
   - Requirements document (the vague feature description)?
   - Existing test coverage (what's already tested)?
   - Tech stack context (Python/pytest, JavaScript/Jest)?
   - Business rules to validate?

**3️⃣  COVERAGE DIMENSIONS:** What aspects should be tested?
   
   Consider these test categories:
   - **Happy Paths** (valid discount codes, successful applications)
   - **Edge Cases** (boundary values: 0%, 100% discounts, $0.01 amounts)
   - **Error Paths** (expired codes, invalid codes, already-used one-time codes)
   - **Business Rules** (no combination, minimum cart value requirements)
   - **Ambiguities** (What if discount > cart total? Case sensitivity?)

**4️⃣  OUTPUT FORMAT:** How should test specifications be structured?
   - Markdown vs XML?
   - What fields per test? (name, purpose, preconditions, steps, expected)
   - How to separate unit vs integration tests?
   - How to flag ambiguities and infrastructure needs?

</div>

<div style="background:#dcfce7; border-left:4px solid #22c55e; padding:20px; border-radius:8px; margin:24px 0; color:#000000;">

### 🔨 STEP 3 - BUILD YOUR TEMPLATE (15-20 minutes)

**YOUR TASK:**

⚠️ **Edit the starter template in the code cell below** by replacing all `TODO` sections with your own design based on your research in Steps 1 & 2.

The starter template provides the basic structure - you need to enhance it by:
1. Improving the QA role definition
2. Adding complete requirements and existing test context
3. Expanding task steps for comprehensive coverage analysis
4. Designing an effective output format for test specifications

**💡 TIP:** Look at the complete example in Cell 20 to see how all pieces fit together!

</div>

---


---

**📖 Full Solution Reference:** 

After completing your template, you can compare your approach with [solutions/activity-3.3-test-generation-solution.md](solutions/activity-3.3-test-generation-solution.md) to see:
- A complete test generation template implementation
- How to identify ambiguities in requirements systematically
- Examples of comprehensive edge case coverage
- Sprint planning and TDD workflow integration

<div style="margin-top:16px; color:#15803d; padding:12px; background:#dcfce7; border-radius:6px; border-left:4px solid #22c55e;">
<strong>💡 Remember:</strong> There's no single "correct" solution. The goal is to build a template that works for your specific testing needs. Focus on understanding the <em>why</em> behind each design decision rather than matching the solution exactly.
</div>

---


In [None]:
# ╔══════════════════════════════════════════════════════════════════════════════╗
# ║  PRACTICE ACTIVITY CODE - Follow Steps 1-3 in the markdown cell above       ║
# ╚══════════════════════════════════════════════════════════════════════════════╝

#═══════════════════════════════════════════════════════════════════════════════
# REQUIREMENTS: Shopping Cart Discount System (intentionally vague!)
#═══════════════════════════════════════════════════════════════════════════════

discount_requirements = """
Feature: Shopping Cart Discount System

Requirements:
1. Users can apply discount codes at checkout
2. Discount types: percentage (10%, 25%, etc.) or fixed amount ($5, $20, etc.)
3. Each discount code has an expiration date
4. Usage limits: one-time use OR unlimited
5. Business rule: Discounts cannot be combined (one per order)
6. Cart total must be > 0 after discount applied
7. Fixed discounts cannot exceed cart total
"""

existing_discount_tests = """
Current test suite (minimal coverage):
- test_apply_percentage_discount() - 10% off $100 cart
- test_fixed_amount_discount() - $5 off $50 cart
"""

#═══════════════════════════════════════════════════════════════════════════════
# ⚠️  STARTER TEMPLATE - EDIT ALL TODO SECTIONS BELOW ⚠️
#═══════════════════════════════════════════════════════════════════════════════
# This template provides basic structure. Your task is to enhance it by:
# 1. Refining the QA role definition
# 2. Structuring comprehensive task steps
# 3. Designing an effective output format
# 4. Adding coverage dimensions and ambiguity detection
#═══════════════════════════════════════════════════════════════════════════════

discount_test_messages = [
    {
        "role": "system",
        # ⚠️ TODO: Refine this role based on what you learned from AWS patterns
        # 💡 Hint: What specific QA expertise is needed for e-commerce testing?
        #          Consider: "QA Automation Lead specializing in..."?
        "content": "You are a QA Automation Lead specializing in e-commerce testing."
    },
    {
        "role": "user",
        "content": f"""
<requirements>
{discount_requirements}

<!-- ⚠️ TODO #1: Should you add more context here?
     Examples to consider:
     • Tech stack: Python/pytest? JavaScript/Jest?
     • Business context: B2C e-commerce platform
     • Compliance requirements: PCI-DSS? Regional pricing laws?
     • Performance expectations: Must handle X transactions/sec?
-->
</requirements>

<existing_tests>
{existing_discount_tests}

<!-- ⚠️ TODO #2: What additional test context would be helpful?
     • Test framework details?
     • Current coverage percentage?
     • Known gaps in testing infrastructure?
-->
</existing_tests>

<tasks>
<!-- ⚠️ TODO #3: Design comprehensive task steps based on AWS test generation patterns
     Reference: https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md
     
     Consider these steps:
     • Step 1: Analyze requirements and identify ambiguities
     • Step 2: List coverage gaps across dimensions (happy/edge/error paths)
     • Step 3: Generate test specifications with clear purpose
     • Step 4: Categorize by test type (unit vs integration)
     • Step 5: Flag infrastructure needs (mocks, fixtures, test data)
-->

1. Analyze requirements and identify ambiguities or missing specifications
   [⚠️ Add guiding questions: What constitutes an ambiguity? How to flag them?]

2. List coverage gaps in existing tests
   [⚠️ Define dimensions: happy paths, edge cases, error paths, business rules]

3. Generate comprehensive test cases
   [⚠️ Specify what each test needs: name format, purpose, preconditions, steps, expected]

4. Categorize tests by type
   [⚠️ Define criteria: What makes a test "unit" vs "integration"?]

5. Document test infrastructure needs
   [⚠️ What should be flagged: mocks, fixtures, environment dependencies?]
</tasks>

<output_format>
<!-- ⚠️ TODO #4: Design your output format based on AWS test generation patterns
     Reference: Look at Cell 20 for the markdown format example
     
     Consider:
     • Markdown (recommended) or XML?
     • Main sections: Analysis? Ambiguities? Coverage Gaps? Test Specs?
     • Test structure: What fields per test? (name, purpose, preconditions, steps, expected)
     • How to separate unit vs integration tests?
     • How to document infrastructure needs?
-->

Provide your test plan in clear format:

## [⚠️ Your Analysis Section Name]
[⚠️ What goes here? Think about requirement analysis and ambiguity detection]

## [⚠️ Your Ambiguities Section Name]
[⚠️ How should unclear requirements be flagged?]

## [⚠️ Your Coverage Gaps Section Name]
[⚠️ What information helps identify what's missing?]

## [⚠️ Your Unit Tests Section Name]
[⚠️ How should individual unit tests be structured?]

### Test: [Descriptive Name]
**Purpose:** [⚠️ Define what makes a good purpose statement]
**Preconditions:** [⚠️ What setup information is needed?]
**Steps:** [⚠️ How detailed should steps be?]
**Expected:** [⚠️ What makes expectations clear and testable?]

## [⚠️ Your Integration Tests Section Name]
[⚠️ Similar structure to unit tests, but what distinguishes integration tests?]

## [⚠️ Your Infrastructure Needs Section Name]
[⚠️ What test dependencies should be documented?]
</output_format>
"""
    }
]

#═══════════════════════════════════════════════════════════════════════════════
# 🧪 TEST YOUR TEMPLATE - Uncomment when you've completed all TODO sections
#═══════════════════════════════════════════════════════════════════════════════

# print("🧪 TESTING YOUR TEST GENERATION TEMPLATE")
# print("="*70)
# discount_test_result = get_chat_completion(discount_test_messages, temperature=0.0)
# print(discount_test_result)
# print("="*70)

#═══════════════════════════════════════════════════════════════════════════════
# HINTS & GUIDANCE
#═══════════════════════════════════════════════════════════════════════════════

print("""
╔══════════════════════════════════════════════════════════════════════════════╗
║                           💡 HINTS FOR SUCCESS                               ║
╚══════════════════════════════════════════════════════════════════════════════╝

📋 TEST GENERATION ELEMENTS TO INCLUDE:
   ✓ Ambiguity detection: Identify unclear or missing requirements
   ✓ Coverage dimensions: Happy paths, edge cases, error paths, business rules
   ✓ Test categorization: Clear distinction between unit and integration tests
   ✓ Comprehensive specs: Purpose, preconditions, steps, expected outcomes
   ✓ Infrastructure flagging: Mocks, fixtures, test data requirements

🤔 AMBIGUITIES TO IDENTIFY IN THIS DISCOUNT SYSTEM:
   • What happens if discount code is expired? (error message? silent fail?)
   • Are discount codes case-sensitive? (SAVE10 vs save10)
   • What if fixed discount > cart total? (set to $0? reject?)
   • Can percentage be 0%? 100%? Over 100%?
   • How are percentages rounded? (0.5 rounds up or down?)
   • Race condition: Multiple users applying one-time-use code simultaneously?
   • Minimum cart value requirement before discount?
   • What if cart is empty when discount is applied?

🧪 EDGE CASES TO COVER:
   • Boundary values: 0%, 1%, 99%, 100% discounts
   • Minimum amounts: $0.01 cart, $0.01 discount
   • Maximum amounts: Very large cart values, very large discounts
   • Expiration: Code expires today (time zone handling?)
   • Usage limits: Exactly at usage limit vs over limit
   • Empty/null/invalid inputs: Missing codes, special characters

❓ SELF-CHECK QUESTIONS:
   → Does my template request ambiguity identification?
   → Does it cover all test dimensions (happy/edge/error/business)?
   → Are test specifications detailed enough to implement?
   → Is the output format clear and actionable for developers?
   → Does it flag infrastructure needs (mock time for expiration tests)?

╔══════════════════════════════════════════════════════════════════════════════╗
║                            🎯 NEXT CHALLENGES                                ║
╚══════════════════════════════════════════════════════════════════════════════╝

After creating your template, extend your learning:
   
   1️⃣  Test it on different features (user authentication, payment processing)
   2️⃣  Create specialized variants:
       • API testing template (focus on contracts, versioning)
       • Security testing template (focus on auth, input validation)
   3️⃣  Compare with the complete example in Cell 20
       • What did you do similarly? Differently?
       • Which approach generates more comprehensive tests?

📚 REFERENCE CELLS:
   • Cell 18: General test generation template with markdown output
   • Cell 20: Complete working example with payment service
   • Cell 33: The 3-step process for this practice activity
""")


---

### 📚 Learn More: Advanced Test Generation Patterns

Want to dive deeper into AI-powered test automation? Explore these resources:

**📖 AWS Anthropic Advanced Patterns**
- [Test Generation Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md) - Production-ready patterns for automated test generation
- Covers advanced topics like test data generation, coverage analysis, and CI/CD integration

**🔗 Related Best Practices**
- [Claude 4 Prompt Engineering Best Practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) - Core prompting techniques
- [Prompt Templates and Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables) - Parameterization strategies

**💡 What You Can Build Next:**
- Integrate this template into your CI/CD pipeline for automatic test generation
- Create specialized variants (API testing, UI testing, security testing)
- Build a test coverage analyzer that suggests missing test scenarios
- Develop test data generators for edge case validation

<div style="margin-top:16px; color:#15803d; padding:12px; background:#dcfce7; border-radius:6px; border-left:4px solid #22c55e;">
<strong>🎯 Pro Tip:</strong> The AWS patterns repository includes examples of integrating test generation with AWS Lambda and CodeBuild. Perfect for automating test creation in your deployment pipeline!
</div>

---


---

<div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); color: white; padding: 24px; border-radius: 12px; margin: 40px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
  <div style="text-align: center; margin-bottom: 20px;">
    <h2 style="color: white; margin: 0; font-size: 1.8em; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">🍵 Suggested Break Point #2</h2>
    <p style="margin: 8px 0; font-size: 1.1em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">~75 minutes elapsed • Halfway through!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">✅ Completed (Sections 1-2):</p>
    <ul style="margin: 8px 0; padding-left: 24px; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Code Review Automation Template</li>
      <li>Test Generation Automation Template</li>
      <li>Production-ready template structures with markdown output</li>
      <li>Hands-on practice building code review and test generation templates</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">🎯 You've completed 2 out of 4 sections!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">⏭️ Coming Next:</p>
    <ul style="margin: 8px 0; padding-left: 24px; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Section 3: LLM-as-Judge Evaluation Rubric</li>
      <li>Validating AI-generated outputs</li>
      <li>Quality gates and automated QA</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">⏱️ Next section: ~25-30 minutes</p>
  </div>
  
  <div style="background: rgba(255,255,255,0.95); padding: 14px; border-radius: 8px; margin: 16px 0; text-align: center; color: #1e293b;">
    <p style="margin: 0; font-weight: bold; font-size: 1.1em; color: #1e293b;">📌 BOOKMARK TO RESUME:</p>
    <p style="margin: 8px 0 0 0; font-size: 1.15em; font-weight: bold; color: #0f172a;">"Section 3: LLM-as-Judge Evaluation Rubric"</p>
  </div>
  
  <p style="text-align: center; margin: 16px 0 0 0; font-size: 0.9em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">
    💡 <em>Great progress! Consider taking a break before continuing with quality assurance.</em>
  </p>
</div>

---


## ⚖️ Section 3: LLM-as-Judge Evaluation Rubric

### Validating AI-Generated Outputs

**The Quality Challenge:**

When AI generates code reviews or test plans, how do you know if they're good?

- ❓ **Trust issue** - Can we rely on AI feedback?
- 📊 **Consistency** - Does quality vary between runs?
- 🎯 **Standards** - Does output meet team expectations?

**Solution: LLM-as-Judge Pattern**

Use a second AI call with a structured rubric to evaluate the first AI's output. Think of it as automated peer review!

#### 🔄 The Workflow

```
1. AI Generator → Produces code review / test plan
2. LLM-as-Judge → Evaluates quality against rubric  
3. Decision → Accept / Request revision / Reject
```

**Benefits:**
- ✅ **Automated QA** - No human review needed for every AI output
- 📊 **Objective scoring** - Rubric provides consistent evaluation
- 🔍 **Transparency** - Shows why output passed or failed
- 🔄 **Feedback loop** - Low scores trigger regeneration with improvements


### 📋 LLM-as-Judge Rubric Template

```xml
<role>
You are a Principal Engineer reviewing AI-generated code feedback.
</role>

<rubric>
1. Accuracy (40%): Do the identified issues/tests align with the actual code/requirements?
2. Completeness (30%): Are major concerns covered? Are tests covering edge cases?
3. Actionability (20%): Are remediation steps clear and feasible?
4. Communication (10%): Is tone professional and structure clear?
</rubric>

<instructions>
Score each criterion 1-5 with detailed rationale:
- 5: Excellent - Exceeds expectations
- 4: Good - Meets expectations with minor gaps
- 3: Acceptable - Meets minimum bar
- 2: Needs work - Significant gaps
- 1: Unacceptable - Fails to meet standards

Calculate weighted total score.
Recommend:
- ACCEPT (≥3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements, provide specific guidance
- REJECT (<2.5): Start over with different approach
</instructions>

<submission>
{{llm_output_under_review}}
</submission>

<output_format>
<evaluation>
 <scores>
   <criterion name="Accuracy" weight="40%">
     <score></score>
     <rationale></rationale>
   </criterion>
   <!-- ... other criteria ... -->
 </scores>
 <weighted_total></weighted_total>
 <recommendation>ACCEPT/REVISE/REJECT</recommendation>
 <feedback></feedback>
</evaluation>
</output_format>
```

#### 🔑 Rubric Design Principles

1. **Weighted Criteria** - Most important aspects (accuracy) weighted highest
2. **Explicit Scale** - 1-5 with clear definitions prevents ambiguity
3. **Evidence Required** - Rationale forces specific justification
4. **Actionable Thresholds** - Clear cut-offs (3.5, 2.5) for decisions
5. **Improvement Guidance** - "REVISE" verdict includes specific feedback


### 💻 Working Example: Evaluating an AI Code Review

Let's evaluate the quality of an AI-generated code review using our judge rubric.


In [None]:
# Example: Evaluating an AI-Generated Code Review

# Step 1: Generate a code review (simulated - you could use the earlier example)
sample_code = """
def calculate_discount(price, discount_percent):
    return price - (price * discount_percent / 100)
"""

# Simulated AI review (normally this would come from get_chat_completion)
ai_generated_review = """
<review>
 <inner_monologue>
 Analyzing the discount calculation function. The logic appears straightforward but 
 I should check for edge cases: negative values, values > 100, type handling, 
 and potential precision issues with floating point arithmetic.
 </inner_monologue>
 
 <issues>
   <issue>
     <severity>major</severity>
     <description>No input validation for discount_percent</description>
     <evidence>Function accepts any numeric value. discount_percent > 100 would result in negative price.</evidence>
     <recommendation>Add validation: if not 0 <= discount_percent <= 100: raise ValueError("Discount must be between 0 and 100")</recommendation>
   </issue>
   
   <issue>
     <severity>minor</severity>
     <description>No type hints</description>
     <evidence>Parameters lack type annotations, making the expected types unclear.</evidence>
     <recommendation>Add type hints: def calculate_discount(price: float, discount_percent: float) -> float:</recommendation>
   </issue>
   
   <issue>
     <severity>nit</severity>
     <description>Missing docstring</description>
     <evidence>Function lacks documentation explaining parameters and return value.</evidence>
     <recommendation>Add docstring with parameter descriptions and example usage.</recommendation>
   </issue>
 </issues>
 
 <verdict>NEEDS REVISION</verdict>
 <summary>Function has correct core logic but lacks input validation which could lead to runtime bugs. Adding validation and type hints would make it production-ready.</summary>
</review>
"""

# Step 2: Evaluate with LLM-as-Judge
judge_messages = [
    {
        "role": "system",
        "content": "You are a Principal Engineer reviewing AI-generated code feedback."
    },
    {
        "role": "user",
        "content": f"""
<context>
Original code under review:
{sample_code}

AI-generated review to evaluate:
</context>

<rubric>
1. Accuracy (40%): Do identified issues actually exist and are correctly described?
2. Completeness (30%): Are major concerns covered? Any critical issues missed?
3. Actionability (20%): Are recommendations specific and implementable?
4. Communication (10%): Is the review professional, clear, and well-structured?
</rubric>

<instructions>
Score each criterion 1-5 with detailed rationale.
Calculate weighted total: (Accuracy×0.4) + (Completeness×0.3) + (Actionability×0.2) + (Communication×0.1)
Recommend:
- ACCEPT (≥3.5): Production-ready
- REVISE (2.5-3.4): Needs improvements  
- REJECT (<2.5): Unacceptable quality
</instructions>

<submission>
{ai_generated_review}
</submission>

<output_format>
Provide structured evaluation with scores, weighted total, recommendation, and specific feedback.
</output_format>
"""
    }
]

print("⚖️ JUDGE EVALUATION IN PROGRESS...")
print("="*70)
judge_result = get_chat_completion(judge_messages, temperature=0.0)
print(judge_result)
print("="*70)


### 📊 Why LLM-as-Judge Is Powerful

**1. Automated Quality Gate**
```python
if weighted_score >= 3.5:
    # Auto-approve and use the AI review
    post_review_comment(ai_generated_review)
elif weighted_score >= 2.5:
    # Trigger regeneration with feedback
    regenerate_review(with_guidance=judge_feedback)
else:
    # Fallback to human review
    notify_human_reviewer()
```

**2. Consistent Standards**
- Rubric encodes team expectations
- Same criteria applied every time
- Reduces reviewer bias

**3. Continuous Improvement**
- Low scores → Prompt refinement
- Track score trends over time
- A/B test different prompt versions

**4. Transparency & Trust**
- Shows reasoning for accept/reject
- Teams can audit decisions
- Builds confidence in AI-assisted workflows

#### 🎯 Real-World Use Cases

| Application | Implementation |
|-------------|----------------|
| **CI/CD Pipeline** | Code review → Judge eval → Auto-comment if score > 3.5 |
| **Test Plan Validation** | Generate tests → Judge completeness → Flag gaps |
| **Documentation Review** | AI writes docs → Judge clarity → Request revisions |
| **Prompt Engineering** | Compare prompts → Judge outputs → Pick best version |

#### 🔧 Customization Tips

**Adjust weights for your context:**
```python
# Security-focused team
Accuracy: 50%, Completeness: 30%, Actionability: 15%, Communication: 5%

# DevRel/Documentation team  
Communication: 40%, Actionability: 30%, Accuracy: 20%, Completeness: 10%

# Fast-moving startup
Actionability: 50%, Accuracy: 30%, Completeness: 15%, Communication: 5%
```

**Add domain-specific criteria:**
- **Performance Review**: "Does review mention Big-O complexity?"
- **Security Review**: "Are OWASP Top 10 risks addressed?"
- **API Review**: "Are breaking changes clearly flagged?"


---

<div style="background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); color: white; padding: 24px; border-radius: 12px; margin: 40px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
  <div style="text-align: center; margin-bottom: 20px;">
    <h2 style="color: white; margin: 0; font-size: 1.8em; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">🧃 Suggested Break Point #3</h2>
    <p style="margin: 8px 0; font-size: 1.1em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">~105 minutes elapsed • Almost there!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">✅ Completed (Sections 1-3):</p>
    <ul style="margin: 8px 0; padding-left: 24px; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Code Review Template with Decomposition + CoT</li>
      <li>Test Case Generation with Coverage Analysis</li>
      <li>LLM-as-Judge Evaluation Rubric</li>
      <li>Quality gates and automated validation</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">🎯 You've completed 3 out of 4 sections!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">⏭️ Final Sprint:</p>
    <ul style="margin: 8px 0; padding-left: 24px; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Hands-On Practice Activities (4 exercises)</li>
      <li>Comprehensive code review across multiple dimensions</li>
      <li>Test generation for ambiguous requirements</li>
      <li>Template customization and quality evaluation</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">⏱️ Remaining time: ~40-50 minutes</p>
  </div>
  
  <div style="background: rgba(255,255,255,0.95); padding: 14px; border-radius: 8px; margin: 16px 0; text-align: center; color: #1e293b;">
    <p style="margin: 0; font-weight: bold; font-size: 1.1em; color: #1e293b;">📌 BOOKMARK TO RESUME:</p>
    <p style="margin: 8px 0 0 0; font-size: 1.15em; font-weight: bold; color: #0f172a;">"Hands-On Practice Activities"</p>
  </div>
  
  <p style="text-align: center; margin: 16px 0 0 0; font-size: 0.9em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">
    💡 <em>You're in the home stretch! Take a quick break before the practice exercises.</em>
  </p>
</div>

---


## 🏋️ Hands-On Practice Activities

### Activity 3.2: Comprehensive Code Review Template

**Goal:** Create a template for comprehensive code review across multiple dimensions.

**Scenario:** Your team needs automated code reviews for all API changes. Build a prompt template that evaluates:
- Security (authentication, input validation, common vulnerabilities)
- Performance (query optimization, algorithm efficiency)
- Code Quality (readability, maintainability, error handling)
- Best Practices (language idioms, design patterns)

**Your Task:**
1. Adapt the code review template with comprehensive review guidelines
2. Test it on the API endpoint code below
3. Evaluate: Did it catch issues across multiple dimensions?


In [None]:
# Activity 3.2: Security Code Review

security_code = """
+ @app.route('/api/user/<user_id>/profile', methods=['GET', 'POST'])
+ def user_profile(user_id):
+     if request.method == 'POST':
+         # Update user profile
+         data = request.get_json()
+         query = f"UPDATE users SET bio='{data['bio']}', website='{data['website']}' WHERE id={user_id}"
+         db.execute(query)
+         
+         # Store uploaded avatar
+         if 'avatar' in request.files:
+             file = request.files['avatar']
+             file.save(f'/uploads/{file.filename}')
+         
+         return jsonify({"message": "Profile updated"})
+     
+     # Get user profile
+     user = db.query(f"SELECT * FROM users WHERE id={user_id}").fetchone()
+     return jsonify(user)
"""

# TODO: Build your security review template
# Hints:
# - Role: "Senior Security Engineer" or "Application Security Specialist"
# - Guidelines: Check for SQL injection, path traversal, missing auth, XSS
# - Focus areas: Input validation, authentication, file upload security
# - Severity: Use security-specific levels (Critical/High/Medium/Low)

security_review_messages = [
    {
        "role": "system",
        "content": "You are a Senior Application Security Engineer specializing in web API security."
    },
    {
        "role": "user",
        "content": f"""
<context>
Repository: user-api-service  
Endpoint: User Profile Management (new endpoint)
Security Focus: OWASP Top 10, authentication, input validation
</context>

<code_diff>
{security_code}
</code_diff>

<review_guidelines>
1. Check for OWASP Top 10 vulnerabilities (SQL injection, XSS, broken auth, etc.)
2. Verify authentication and authorization mechanisms
3. Assess input validation and sanitization
4. Review file upload handling for path traversal
5. Check for sensitive data exposure
6. Cite exact lines with CVE/CWE references where applicable
</review_guidelines>

<tasks>
Step 1 - Think: In <inner_monologue> tags, identify security vulnerabilities.
Step 2 - Assess: For each issue, provide:
  • Severity (critical/high/medium/low)
  • Vulnerability type (SQL injection, etc.)
  • Evidence (line numbers, attack vector)
  • CVE/CWE reference if applicable
Step 3 - Suggest: Provide secure code alternatives.
Step 4 - Verdict: Security assessment (block/requires-fixes/approve-with-notes).
</tasks>

<output_format>
<security_review>
 <vulnerabilities>
   <vulnerability>
     <severity></severity>
     <type></type>
     <description></description>
     <evidence></evidence>
     <cwe_reference></cwe_reference>
     <recommendation></recommendation>
   </vulnerability>
 </vulnerabilities>
 <verdict></verdict>
 <summary></summary>
</security_review>
</output_format>
"""
    }
]

print("🔒 SECURITY REVIEW - Activity 3.2")
print("="*70)
security_result = get_chat_completion(security_review_messages, temperature=0.0)
print(security_result)
print("="*70)
print("\n💡 Expected findings:")
print("   - SQL Injection (Critical) - f-string query construction")
print("   - Path Traversal (High) - Unsafe file.filename usage")
print("   - Missing Authentication (Critical) - No auth check on endpoint")
print("   - Potential XSS (Medium) - Unvalidated user data returned")


#### ✅ Solution Analysis

**Key Best Practices Demonstrated:**

1. **Multi-Dimensional Role** - Uses "Senior Software Engineer" with broad expertise (security, performance, quality)
2. **Balanced Review Guidelines** - Covers security, performance, maintainability, and best practices
3. **Clear Categories** - Categorizes findings (Security / Performance / Quality / Correctness)
4. **Practical Severity** - Uses CRITICAL/MAJOR/MINOR based on impact across all dimensions
5. **Actionable Feedback** - Provides concrete fixes and recommendations

**Expected Findings:**
- ✅ SQL Injection - f-string query construction (Security)
- ✅ Path Traversal - Unsafe file.filename usage (Security)
- ✅ Missing Authentication - No auth decorator (Security)
- ✅ Poor Error Handling - Potential XSS in responses (Quality/Security)

**📖 Full Solution:** See [solutions/activity-3.2-code-review-solution.md](solutions/activity-3.2-code-review-solution.md) for:
- Detailed analysis of each best practice
- Production CI/CD integration examples
- Customization patterns for different tech stacks and contexts
- Metrics for tracking template effectiveness


### Activity 3.3: Template Customization Challenge

**Goal:** Customize prompt templates for your team's specific needs.

**Scenario:** Different teams have different review standards. Adapt the base template for various contexts.

**Your Task:** Choose one and implement it:

**Option A: Performance-Focused Review**
- Role: "Senior Performance Engineer"
- Focus: Big-O complexity, caching, database query optimization, memory usage
- Test on: A function with nested loops or N+1 query problem

**Option B: DevOps/SRE Review**  
- Role: "Site Reliability Engineer"
- Focus: Observability (logging, metrics, tracing), error handling, graceful degradation
- Test on: A service initialization function

**Option C: API Design Review**
- Role: "API Architect"  
- Focus: RESTful conventions, versioning, backward compatibility, error responses
- Test on: A new API endpoint design

Pick one and build it below!


#### ✅ Solution Analysis

**Key Best Practices Demonstrated:**

1. **Domain-Specific Role** - "Performance Engineer" not generic "Engineer"
2. **Scale Context** - "Must handle 1000+ posts" sets clear performance bar
3. **Quantified Analysis** - Big-O notation, query counts, latency estimates (not vague "slow")
4. **Before/After Metrics** - Shows improvement: 100s → 0.05s (2000x faster!)
5. **Actionable Optimizations** - Provides exact code for the fix

**Expected Findings:**
- ✅ N+1 Query Problem - 2001 database queries (1 user + 1000 posts + 1000 likes)
- ✅ Complexity: O(n) queries with network latency = 100 seconds for 1000 posts
- ✅ Solution: Single join query reduces to O(1) = 0.05 seconds
- ✅ Additional opportunities: Caching, pagination, indexing

**Adaptation Pattern:**

| Domain | Role | Focus | Output Metrics |
|--------|------|-------|----------------|
| **Performance** | Performance Engineer | Big-O, N+1, caching | Latency, query counts |
| **SRE** | Site Reliability Engineer | Logging, metrics, resilience | Observability gaps |
| **API Design** | API Architect | REST, versioning | Breaking changes |

**Key Takeaway:** Same template structure, different expertise area!

**📖 Full Solution:** See [solutions/activity-3.3-customization-solution.md](solutions/activity-3.3-customization-solution.md) for:
- Complete N+1 query analysis and optimized code
- Full adaptation patterns for SRE, API design, React
- When to create domain-specific templates
- Step-by-step customization strategy


In [None]:
# Activity 3.3: Template Customization

# Example: Performance-Focused Review
perf_code = """
+ def get_user_posts_with_likes(user_id):
+     user = User.query.get(user_id)
+     posts = []
+     for post_id in user.post_ids:
+         post = Post.query.get(post_id)
+         like_count = Like.query.filter_by(post_id=post.id).count()
+         post.likes = like_count
+         posts.append(post)
+     return posts
"""

# TODO: Customize for YOUR chosen focus area
# This example shows performance review - adapt for your choice!

custom_messages = [
    {
        "role": "system",
        "content": "You are a Senior Performance Engineer specializing in database optimization."
    },
    {
        "role": "user",
        "content": f"""
<context>
Repository: social-media-api
Function: get_user_posts_with_likes
Performance Requirements: Must handle users with 1000+ posts efficiently
</context>

<code_diff>
{perf_code}
</code_diff>

<review_guidelines>
1. Analyze algorithmic complexity (Big-O notation)
2. Identify N+1 query problems
3. Check for caching opportunities
4. Assess memory usage patterns
5. Recommend performance optimizations
6. Estimate performance impact with data size
</review_guidelines>

<tasks>
Step 1 - Think: In <inner_monologue>, analyze time/space complexity and identify bottlenecks.
Step 2 - Assess: For each issue:
  • Severity (critical/high/medium/low based on performance impact)
  • Complexity analysis (O(n), O(n²), etc.)
  • Evidence (specific operations causing slowdown)
  • Performance impact estimate
Step 3 - Suggest: Provide optimized code with complexity improvement.
Step 4 - Verdict: Performance rating and estimated improvement.
</tasks>

<output_format>
<performance_review>
 <complexity_analysis></complexity_analysis>
 <issues>
   <issue>
     <severity></severity>
     <problem></problem>
     <current_complexity></current_complexity>
     <evidence></evidence>
     <optimization></optimization>
     <improved_complexity></improved_complexity>
   </issue>
 </issues>
 <verdict></verdict>
 <summary></summary>
</performance_review>
</output_format>
"""
    }
]

print("⚡ CUSTOM TEMPLATE - Activity 3.3 (Performance Review)")
print("="*70)
custom_result = get_chat_completion(custom_messages, temperature=0.0)
print(custom_result)
print("="*70)
print("\n💡 Adaptation tips:")
print("   - Changed role to match domain")
print("   - Added domain-specific guidelines (Big-O, N+1)")
print("   - Modified severity to reflect performance impact")
print("   - Customized output format for complexity analysis")
print("\n   Try adapting this for SRE or API Design review!")


### Activity 3.4: Quality Evaluation with LLM-as-Judge

**Goal:** Build an automated quality gate for AI-generated outputs.

**Scenario:** You're implementing automated code reviews in your CI/CD pipeline. Before posting AI reviews to PRs, you want quality assurance.

**Your Task:**
1. Generate a code review (use any previous example or create new one)
2. Create an LLM-as-Judge rubric for your team's standards
3. Evaluate the review and decide: Accept / Revise / Reject
4. Reflection: Would this catch low-quality AI outputs?


#### ✅ Solution Analysis

**Key Best Practices Demonstrated:**

1. **Intentionally Low-Quality Input** - Tests judge with vague review ("Make it better")
2. **Context-Specific Rubric** - Weights match CI/CD needs (Specificity 40%, Actionability 30%)
3. **Clear Scoring Scale** - 1-5 with explicit definitions (not subjective "good/bad")
4. **Actionable Thresholds** - ≥4.0 = accept, 2.5-3.9 = revise, <2.5 = reject
5. **Improvement Loop** - Provides specific feedback for regeneration

**Expected Judge Scores:**

```
Specificity: 1/5 ❌ - "Function could be improved" is vague
Actionability: 1/5 ❌ - "Make it better" not actionable
Technical Accuracy: 2/5 ⚠️ - Can't verify without specifics
Completeness: 2/5 ⚠️ - Only 1 issue found, likely incomplete

Weighted Total: (1×0.4) + (1×0.3) + (2×0.2) + (2×0.1) = 1.3
Decision: REJECT (<2.5) 🚫
```

**Production Workflow:**

```python
if score >= 4.0:
    post_to_pr(review)  # Auto-approve
elif score >= 2.5:
    regenerate_with_feedback()  # Retry with guidance
else:
    flag_for_human_review()  # Fallback
```

**Why This Matters:**
- Prevents vague AI outputs from reaching users
- Builds trust through consistent quality
- Enables true automation (not just "AI suggestion")

**📖 Full Solution:** See [solutions/activity-3.4-judge-solution.md](solutions/activity-3.4-judge-solution.md) for:
- Complete judge evaluation breakdown
- Production quality gate implementation with retry logic
- Monitoring dashboard examples
- Success metrics to track (acceptance rate, cost per review)


In [None]:
# Activity 3.4: Build Your Own Judge

# Sample AI-generated output to evaluate (simulated)
ai_output_to_judge = """
<review>
 <issues>
   <issue>
     <severity>medium</severity>
     <description>Function could be improved</description>
     <evidence>The code is not optimal</evidence>
     <recommendation>Make it better</recommendation>
   </issue>
 </issues>
 <verdict>NEEDS WORK</verdict>
 <summary>Some issues found</summary>
</review>
"""

# TODO: This is a LOW QUALITY review (vague, no specifics)
# Build a judge that catches this!

judge_eval_messages = [
    {
        "role": "system",
        "content": "You are a Principal Engineer evaluating AI-generated code reviews for your team's CI/CD pipeline."
    },
    {
        "role": "user",
        "content": f"""
<context>
Your team is implementing automated code reviews. Reviews must meet high standards before being posted to PRs.
</context>

<rubric>
1. Specificity (40%): Are issues concrete with exact evidence (line numbers, code snippets)?
2. Actionability (30%): Can developer immediately act on recommendations?
3. Technical Accuracy (20%): Are the issues technically sound?
4. Completeness (10%): Are major categories covered (security, performance, correctness)?
</rubric>

<instructions>
Score each criterion 1-5:
- 5: Excellent - Ready for production
- 4: Good - Minor improvements needed
- 3: Acceptable - Meets minimum bar
- 2: Poor - Significant issues, needs revision
- 1: Unacceptable - Reject and regenerate

Calculate weighted score.
Provide specific feedback for scores < 4.

Decision thresholds:
- ACCEPT (≥4.0): Post to PR
- REVISE (2.5-3.9): Regenerate with specific guidance
- REJECT (<2.5): Discard, use different approach
</instructions>

<submission>
{ai_output_to_judge}
</submission>

<output_format>
<evaluation>
 <scores>
   <criterion name="Specificity" weight="40%">
     <score></score>
     <rationale></rationale>
     <improvement_needed></improvement_needed>
   </criterion>
   <criterion name="Actionability" weight="30%">
     <score></score>
     <rationale></rationale>
     <improvement_needed></improvement_needed>
   </criterion>
   <criterion name="Technical Accuracy" weight="20%">
     <score></score>
     <rationale></rationale>
     <improvement_needed></improvement_needed>
   </criterion>
   <criterion name="Completeness" weight="10%">
     <score></score>
     <rationale></rationale>
     <improvement_needed></improvement_needed>
   </criterion>
 </scores>
 <weighted_score></weighted_score>
 <decision>ACCEPT/REVISE/REJECT</decision>
 <feedback>Specific guidance for improvement</feedback>
</evaluation>
</output_format>
"""
    }
]

print("⚖️ QUALITY EVALUATION - Activity 3.4")
print("="*70)
print("Evaluating this AI-generated review:")
print(ai_output_to_judge)
print("\n" + "="*70)
judge_eval_result = get_chat_completion(judge_eval_messages, temperature=0.0)
print(judge_eval_result)
print("="*70)
print("\n💡 This review is intentionally vague. Your judge should:")
print("   - Give low scores for Specificity (no line numbers)")
print("   - Give low scores for Actionability ('make it better' is useless)")
print("   - Recommend REVISE or REJECT")
print("\n   If your judge caught these issues, it's working! ✅")


### 📦 Reference Implementation: Production-Ready Template Library

Below is a complete, copy-paste ready implementation that demonstrates all best practices from this module.


In [None]:
# Production-Ready Template Library
# Copy this to your project and customize for your team

from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from enum import Enum

# ============================================
# 1. TEMPLATE DEFINITIONS
# ============================================

class ReviewTemplate:
    """Base template for code reviews with parameterization"""
    
    @staticmethod
    def code_review_template(
        tech_stack: str = "Python microservices",
        repo_name: str = "{{repo_name}}",
        service_name: str = "{{service_name}}",
        code_diff: str = "{{code_diff}}"
    ) -> List[Dict[str, str]]:
        """
        Production-ready code review template.
        
        Args:
            tech_stack: Technology focus (e.g., "Python microservices", "React frontend")
            repo_name: Repository name for context
            service_name: Service/component name
            code_diff: Git diff to review
        
        Returns:
            Messages array ready for AI completion
        """
        return [
            {
                "role": "system",
                "content": f"You are a Senior Backend Engineer specializing in {tech_stack}."
            },
            {
                "role": "user",
                "content": f"""
<context>
Repository: {repo_name}
Service: {service_name}
</context>

<code_diff>
{code_diff}
</code_diff>

<review_guidelines>
1. Highlight issues affecting correctness, security, performance, and maintainability.
2. Cite exact lines or blocks.
3. If code is acceptable, confirm with justification.
</review_guidelines>

<tasks>
Step 1 - Think: In <inner_monologue> tags, outline potential issues.
Step 2 - Assess: For each issue, provide severity, description, evidence.
Step 3 - Suggest: Offer actionable remediation tips.
Step 4 - Verdict: Conclude with pass/fail and summary.
</tasks>

<output_format>
<review>
 <inner_monologue>...</inner_monologue>
 <issues>
   <issue>
     <severity></severity>
     <description></description>
     <evidence></evidence>
     <recommendation></recommendation>
   </issue>
 </issues>
 <verdict></verdict>
 <summary></summary>
</review>
</output_format>
"""
            }
        ]

    @staticmethod
    def security_review_template(
        repo_name: str = "{{repo_name}}",
        code_diff: str = "{{code_diff}}"
    ) -> List[Dict[str, str]]:
        """Security-focused review template"""
        return [
            {
                "role": "system",
                "content": "You are a Senior Application Security Engineer specializing in OWASP Top 10 vulnerabilities."
            },
            {
                "role": "user",
                "content": f"""
<context>
Repository: {repo_name}
Security Focus: OWASP Top 10, authentication, input validation
</context>

<code_diff>
{code_diff}
</code_diff>

<review_guidelines>
1. Check for OWASP Top 10 vulnerabilities
2. Verify authentication and authorization
3. Assess input validation and sanitization
4. Check for sensitive data exposure
5. Cite CVE/CWE references where applicable
</review_guidelines>

<tasks>
Step 1 - Think: In <inner_monologue>, identify security vulnerabilities.
Step 2 - Assess: For each issue, provide severity, type, evidence, CWE reference.
Step 3 - Suggest: Provide secure code alternatives.
Step 4 - Verdict: Security assessment (block/requires-fixes/approve-with-notes).
</tasks>

<output_format>
<security_review>
 <vulnerabilities>
   <vulnerability>
     <severity>critical/high/medium/low</severity>
     <type>vulnerability type</type>
     <description></description>
     <evidence></evidence>
     <cwe_reference></cwe_reference>
     <recommendation></recommendation>
   </vulnerability>
 </vulnerabilities>
 <verdict></verdict>
 <summary></summary>
</security_review>
</output_format>
"""
            }
        ]

    @staticmethod
    def test_generation_template(
        tech_stack: str = "Python/pytest",
        requirements: str = "{{requirements}}",
        existing_tests: str = "{{existing_tests}}"
    ) -> List[Dict[str, str]]:
        """Test case generation template"""
        return [
            {
                "role": "system",
                "content": f"You are a QA Automation Lead with expertise in {tech_stack}."
            },
            {
                "role": "user",
                "content": f"""
<requirements>
{requirements}
</requirements>

<existing_tests>
{existing_tests}
</existing_tests>

<tasks>
1. Analyze requirements and identify ambiguities.
2. List coverage gaps in existing tests.
3. Generate test cases: happy paths, edge cases, error paths, business rules.
4. Separate unit tests from integration tests.
5. Flag missing test data or dependencies.
</tasks>

<reasoning>
Provide analysis in <analysis></analysis> tags.
</reasoning>

<output_format>
<test_plan>
 <ambiguities></ambiguities>
 <coverage_gap></coverage_gap>
 <unit_tests>
   <test>
     <name></name>
     <purpose></purpose>
     <preconditions></preconditions>
     <steps></steps>
     <expected></expected>
   </test>
 </unit_tests>
 <integration_tests>...</integration_tests>
 <test_data_needed></test_data_needed>
</test_plan>
</output_format>
"""
            }
        ]

    @staticmethod
    def judge_template(
        submission: str,
        criteria_weights: Optional[Dict[str, float]] = None
    ) -> List[Dict[str, str]]:
        """LLM-as-Judge evaluation template"""
        
        if criteria_weights is None:
            criteria_weights = {
                "Accuracy": 0.40,
                "Completeness": 0.30,
                "Actionability": 0.20,
                "Communication": 0.10
            }
        
        weights_str = "\n".join([
            f"{i+1}. {name} ({int(weight*100)}%)" 
            for i, (name, weight) in enumerate(criteria_weights.items())
        ])
        
        return [
            {
                "role": "system",
                "content": "You are a Principal Engineer evaluating AI-generated outputs for quality."
            },
            {
                "role": "user",
                "content": f"""
<rubric>
{weights_str}
</rubric>

<instructions>
Score each criterion 1-5 with rationale.
Calculate weighted total.
Recommend: ACCEPT (≥3.5), REVISE (2.5-3.4), REJECT (<2.5)
</instructions>

<submission>
{submission}
</submission>

<output_format>
<evaluation>
 <scores>
   <criterion name="">
     <score></score>
     <rationale></rationale>
   </criterion>
 </scores>
 <weighted_total></weighted_total>
 <recommendation>ACCEPT/REVISE/REJECT</recommendation>
 <feedback></feedback>
</evaluation>
</output_format>
"""
            }
        ]


# ============================================
# 2. WORKFLOW AUTOMATION
# ============================================

@dataclass
class ReviewResult:
    """Structured review result"""
    content: str
    score: Optional[float] = None
    verdict: Optional[str] = None
    passed_quality_gate: bool = False


def automated_review_workflow(
    code_diff: str,
    repo_name: str,
    quality_threshold: float = 3.5,
    max_retries: int = 2
) -> ReviewResult:
    """
    Complete automated review workflow with quality gate.
    
    This demonstrates best practices:
    - Template parameterization
    - LLM-as-Judge validation
    - Retry logic with feedback
    - Structured output
    
    Args:
        code_diff: Git diff to review
        repo_name: Repository name for context
        quality_threshold: Minimum score to accept (default 3.5)
        max_retries: Maximum regeneration attempts
    
    Returns:
        ReviewResult with content and quality metrics
    """
    
    for attempt in range(max_retries + 1):
        try:
            # Step 1: Generate review
            review_messages = ReviewTemplate.code_review_template(
                repo_name=repo_name,
                code_diff=code_diff
            )
            
            review_content = get_chat_completion(review_messages, temperature=0.0)
            if not review_content:
                raise ValueError("Review generation returned empty result")
            
            # Step 2: Evaluate with judge
            judge_messages = ReviewTemplate.judge_template(submission=review_content)
            judge_result = get_chat_completion(judge_messages, temperature=0.0)
            if not judge_result:
                raise ValueError("Judge evaluation returned empty result")
            
            # Step 3: Parse score (simplified - production would use XML parsing)
            # This is a placeholder - implement proper XML parsing
            score = 4.0  # Placeholder
            
            # Step 4: Decision
            if score >= quality_threshold:
                return ReviewResult(
                    content=review_content,
                    score=score,
                    passed_quality_gate=True
                )
            elif attempt < max_retries:
                print(f"⚠️ Quality score {score} below threshold. Retry {attempt+1}/{max_retries}")
                continue
            else:
                return ReviewResult(
                    content=review_content,
                    score=score,
                    passed_quality_gate=False
                )
                
        except Exception as e:
            print(f"❌ Error on attempt {attempt+1}: {e}")
            if attempt == max_retries:
                return ReviewResult(
                    content=f"Error: {e}",
                    passed_quality_gate=False
                )
    
    return ReviewResult(content="Max retries exceeded", passed_quality_gate=False)


# ============================================
# 3. EXAMPLE USAGE
# ============================================

print("📦 Production-Ready Template Library Loaded!")
print("\n✅ Available templates:")
print("   - ReviewTemplate.code_review_template()")
print("   - ReviewTemplate.security_review_template()")
print("   - ReviewTemplate.test_generation_template()")
print("   - ReviewTemplate.judge_template()")
print("\n✅ Workflow automation:")
print("   - automated_review_workflow()")
print("\n💡 Copy this cell to your project and customize!")
print("\n📝 Usage example:")
print("""
# Basic usage
messages = ReviewTemplate.code_review_template(
    tech_stack="React frontend",
    repo_name="my-app",
    code_diff=my_diff
)
result = get_chat_completion(messages)

# With quality gate
result = automated_review_workflow(
    code_diff=my_diff,
    repo_name="my-app",
    quality_threshold=4.0
)
if result.passed_quality_gate:
    post_to_pr(result.content)
""")


---

<div style="background: linear-gradient(135deg, #fa709a 0%, #fee140 100%); color: white; padding: 24px; border-radius: 12px; margin: 40px 0; box-shadow: 0 4px 6px rgba(0,0,0,0.1);">
  <div style="text-align: center; margin-bottom: 20px;">
    <h2 style="color: white; margin: 0; font-size: 1.8em; text-shadow: 2px 2px 4px rgba(0,0,0,0.3);">🎯 Suggested Break Point #4</h2>
    <p style="margin: 8px 0; font-size: 1.1em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">~145 minutes elapsed • Final section!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">✅ Completed:</p>
    <ul style="margin: 8px 0; padding-left: 24px; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Section 1: Code Review Automation Template</li>
      <li>Section 2: Test Case Generation</li>
      <li>Section 3: LLM-as-Judge Evaluation</li>
      <li>All Hands-On Practice Activities (4 exercises)</li>
      <li>Production-Ready Template Library</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">🎉 You've completed all core sections and exercises!</p>
  </div>
  
  <div style="background: rgba(0,0,0,0.25); padding: 16px; border-radius: 8px; margin: 16px 0;">
    <p style="margin: 8px 0; font-size: 1.05em; font-weight: 600; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">⏭️ Final Topics:</p>
    <ul style="margin: 8px 0; padding-left: 24px; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">
      <li>Section 4: Template Best Practices & Quality Checklist</li>
      <li>Version control and maintenance strategies</li>
      <li>Production deployment guidelines</li>
      <li>CI/CD and automation integration patterns</li>
    </ul>
    <p style="margin: 12px 0 0 0; font-size: 0.95em; text-shadow: 1px 1px 2px rgba(0,0,0,0.2);">⏱️ Remaining time: ~5-10 minutes (reading)</p>
  </div>
  
  <div style="background: rgba(255,255,255,0.95); padding: 14px; border-radius: 8px; margin: 16px 0; text-align: center; color: #1e293b;">
    <p style="margin: 0; font-weight: bold; font-size: 1.1em; color: #1e293b;">📌 BOOKMARK TO RESUME:</p>
    <p style="margin: 8px 0 0 0; font-size: 1.15em; font-weight: bold; color: #0f172a;">"Section 4: Template Best Practices"</p>
  </div>
  
  <p style="text-align: center; margin: 16px 0 0 0; font-size: 0.9em; text-shadow: 1px 1px 2px rgba(0,0,0,0.3);">
    💡 <em>Nearly done! The final section covers deployment best practices and is mostly reading.</em>
  </p>
</div>

---


## 📊 Section 4: Template Best Practices & Quality Checklist

### Quality Checklist Before Deployment

Before using a prompt template in production, validate it meets these standards:

#### ✅ Role & Context
- [ ] **Role description** matches task scope and domain expertise
- [ ] **Expertise level** is appropriate (Junior/Senior/Principal)
- [ ] **Domain specification** is clear (Backend/Frontend/Security/Performance)
- [ ] **Context** includes necessary background (repo, service, requirements)

#### ✅ Instructions & Structure
- [ ] **Tasks decomposed** into explicit, numbered steps
- [ ] **Required outputs** are clearly specified
- [ ] **XML/structured tags** used for organization (`<context>`, `<tasks>`, etc.)
- [ ] **Examples provided** where format is ambiguous

#### ✅ Reasoning & Transparency
- [ ] **Chain-of-thought** requested for complex analysis
- [ ] **Inner monologue** tagged if reasoning should be separated from output
- [ ] **Evidence required** for all claims (line numbers, specific quotes)
- [ ] **Rationale requested** for subjective decisions

#### ✅ Output Format
- [ ] **Structured format** defined (XML, JSON, or clear template)
- [ ] **Severity/priority** levels standardized across team
- [ ] **Output is parseable** by automation tools if needed
- [ ] **Format examples** provided in prompt or documentation

#### ✅ Evaluation & Quality
- [ ] **LLM-as-Judge rubric** defined with weighted criteria
- [ ] **Acceptance thresholds** established (e.g., score ≥ 3.5)
- [ ] **Failure modes** identified with fallback strategies
- [ ] **Quality metrics** tracked over time

#### ✅ Parameterization & Reuse
- [ ] **Variables identified** and marked with `{{placeholders}}`
- [ ] **Template documented** with parameter descriptions
- [ ] **Usage examples** provided for team members
- [ ] **Default values** specified where appropriate

#### ✅ Testing & Validation
- [ ] **Tested on multiple scenarios** (happy path, edge cases, errors)
- [ ] **Peer reviewed** by subject matter experts
- [ ] **Failure cases** tested (what happens with bad input?)
- [ ] **Performance measured** (latency, token usage, cost)

#### ✅ Team Alignment
- [ ] **Standards match** team conventions (severity labels, output format)
- [ ] **Language/tone** appropriate for team culture
- [ ] **Integration points** defined (CI/CD, IDE, chat tools)
- [ ] **Feedback mechanism** established for continuous improvement


### 📝 Template Versioning & Maintenance

**Treat prompts like code** - version control and track changes!

#### Version Control Structure
```
prompts/
├── code-review/
│   ├── v1.0-baseline.md
│   ├── v1.1-added-security-focus.md
│   ├── v2.0-restructured-output.md
│   └── CHANGELOG.md
├── test-generation/
│   ├── v1.0-baseline.md
│   └── CHANGELOG.md
└── llm-as-judge/
    ├── code-review-judge-v1.0.md
    └── CHANGELOG.md
```

#### CHANGELOG Example
```markdown
## Code Review Template - Changelog

### v2.0 (2024-03-15)
**Breaking Changes:**
- Changed output format from plain text to XML
- Renamed severity levels: blocker→critical, nit→trivial

**Improvements:**
- Added <inner_monologue> for reasoning transparency
- Increased evidence requirement (must cite line numbers)
- Added performance impact estimation

**Metrics:**
- LLM-as-Judge avg score: 4.2 → 4.6
- False positive rate: 12% → 8%
- User satisfaction: 3.8 → 4.3

### v1.1 (2024-02-20)
**Improvements:**
- Added security-specific guidelines (OWASP Top 10)
- Increased token limit to handle larger diffs

**Metrics:**
- Caught 15% more security issues in testing
```

#### When to Version Bump
- **Major (v1 → v2)**: Breaking changes to output format, role changes
- **Minor (v1.0 → v1.1)**: Added capabilities, new guidelines
- **Patch (v1.1.0 → v1.1.1)**: Bug fixes, clarity improvements

#### A/B Testing Prompts
```python
# Compare two prompt versions
results_v1 = run_reviews_with_template("code-review-v1.0.md", test_prs)
results_v2 = run_reviews_with_template("code-review-v2.0.md", test_prs)

# Evaluate with LLM-as-Judge
scores_v1 = [judge(r) for r in results_v1]
scores_v2 = [judge(r) for r in results_v2]

print(f"v1.0 avg score: {mean(scores_v1)}")  # 3.8
print(f"v2.0 avg score: {mean(scores_v2)}")  # 4.3
# Deploy v2.0!
```


### 🚀 Extension & Automation Ideas

Ready to take it further? Here are real-world integration patterns:

#### 1. CI/CD Pipeline Integration
```yaml
# .github/workflows/ai-code-review.yml
name: AI Code Review

on: pull_request

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - name: Get PR diff
        run: gh pr diff ${{ github.event.pull_request.number }} > diff.txt
      
      - name: Run AI Review
        run: python scripts/ai_review.py --diff diff.txt --template prompts/code-review-v2.0.md
      
      - name: Evaluate with Judge
        run: python scripts/judge_review.py --review review.json
      
      - name: Post if High Quality (score ≥ 4.0)
        if: steps.judge.outputs.score >= 4.0
        run: gh pr comment ${{ github.event.pull_request.number }} --body-file review.md
```

#### 2. IDE Integration (VS Code Extension)
```javascript
// AI Review on Save
vscode.workspace.onDidSaveTextDocument((doc) => {
  const diff = getDiff(doc);
  const template = loadTemplate('code-review-v2.0.md');
  const review = callAI(template, diff);
  const score = judgeReview(review);
  
  if (score >= 3.5) {
    showInlineComments(review);
  }
});
```

#### 3. Slack Bot Integration
```python
@slack_app.command("/ai-review")
def review_command(ack, body, say):
    pr_url = body['text']
    diff = github.get_pr_diff(pr_url)
    
    review = generate_review(diff, template='code-review-v2.0.md')
    score = judge_review(review)
    
    if score >= 4.0:
        say(f"✅ AI Review (score: {score}/5.0):\n{review}")
    else:
        say(f"⚠️ Low confidence review (score: {score}/5.0). Human review recommended.")
```

#### 4. Pre-Commit Hook
```bash
#!/bin/bash
# .git/hooks/pre-commit

# Get staged changes
git diff --cached > /tmp/staged.diff

# Run AI review
python scripts/quick_review.py /tmp/staged.diff

# Ask for confirmation if issues found
if [ $? -ne 0 ]; then
    read -p "AI found issues. Continue? (y/n) " -n 1 -r
    echo
    if [[ ! $REPLY =~ ^[Yy]$ ]]; then
        exit 1
    fi
fi
```

#### 5. Test Generation in Sprint Planning
```python
def generate_test_plan(feature_spec: str) -> TestPlan:
    """Generate test plan during sprint planning"""
    
    # Generate tests
    test_plan = generate_tests(
        requirements=feature_spec,
        existing_tests=get_current_suite(),
        template='test-generation-v1.0.md'
    )
    
    # Validate coverage
    judge_result = evaluate_test_plan(test_plan)
    
    if judge_result.score < 3.5:
        # Regenerate with feedback
        test_plan = generate_tests(
            requirements=feature_spec,
            existing_tests=get_current_suite(),
            template='test-generation-v1.0.md',
            previous_feedback=judge_result.feedback
        )
    
    return test_plan

# Use in planning:
story_points = estimate_from_test_count(test_plan.total_tests)
```

#### 6. Continuous Monitoring Dashboard
```python
# Track prompt performance over time
dashboard = {
    "code_review_v2.0": {
        "avg_judge_score": 4.3,
        "usage_count": 1247,
        "acceptance_rate": 0.89,
        "avg_latency_ms": 3200,
        "cost_per_review": 0.04
    },
    "test_generation_v1.0": {
        "avg_judge_score": 3.9,
        "usage_count": 543,
        "acceptance_rate": 0.76,
        "avg_latency_ms": 4100,
        "cost_per_plan": 0.08
    }
}
```

#### 🎯 Start Small, Scale Gradually
1. **Week 1**: Use templates manually in code reviews
2. **Week 2**: Add LLM-as-Judge validation
3. **Week 3**: Integrate into one repo's CI/CD
4. **Month 2**: Expand to team repos, collect metrics
5. **Month 3**: Optimize based on feedback, version templates


---

## 📈 Track Your Progress

### Self-Assessment Questions

After completing Module 3, reflect on these questions:

1. **Can I design code review prompts with task decomposition?**
   - Do you understand how to break reviews into steps (analyze → assess → suggest → verdict)?
   - Can you create role prompts that match domain expertise?

2. **Can I create test generation templates that identify coverage gaps?**
   - Can you design prompts that compare requirements vs existing tests?
   - Do you know how to structure test specifications (purpose, preconditions, steps, expected)?

3. **Can I build LLM-as-Judge rubrics with weighted criteria?**
   - Can you define evaluation criteria appropriate for your domain?
   - Do you know how to set acceptance thresholds and provide feedback?

4. **Can I parameterize templates for reuse?**
   - Do you know how to identify and mark template variables (`{{placeholder}}`)?
   - Can you document template parameters for team use?

5. **Can I refine templates based on feedback?**
   - Do you understand version control for prompts?
   - Can you A/B test different prompt versions?

6. **Do I understand how to prepare prompts for CI/CD integration?**
   - Can you design structured outputs that tools can parse?
   - Do you know how to chain prompts (generate → judge → act)?

### ✅ Check Off Your Learning Objectives

Review the module objectives and check what you've mastered:

- [ ] **Implement SDLC-focused prompts** for code review, test generation, and documentation
- [ ] **Design reusable templates** with parameterized sections for specific workflows
- [ ] **Evaluate prompt effectiveness** using LLM-as-Judge rubrics
- [ ] **Refine and adapt templates** based on feedback and edge cases
- [ ] **Apply best practices** for version control, parameterization, and quality assurance

<div style="margin-top:16px; color:#065f46; padding:12px; background:#d1fae5; border-radius:6px; border-left:4px solid #10b981;">
<strong>💡 Self-Check:</strong> <br><br>
If you can confidently check off 4+ objectives, you're ready to apply these techniques in production!<br>
If not, revisit the sections where you feel less confident and try the practice activities again.
</div>


## 🎊 Module 3 Complete!

<div style="background: linear-gradient(135deg, #8b5cf6 0%, #6d28d9 100%); padding: 32px; border-radius: 12px; color: white; text-align: center; margin: 32px 0;">
  <h2 style="margin: 0 0 16px 0; color: white;">🎉 Congratulations!</h2>
  <p style="font-size: 1.2em; margin: 0; opacity: 0.95;">
    You've mastered production-ready prompt engineering for SDLC tasks!
  </p>
</div>

---

### What You've Accomplished

- ✅ **Applied prompt engineering tactics** to real SDLC scenarios
- ✅ **Built code review templates** with decomposition and chain-of-thought
- ✅ **Created test generation workflows** that identify coverage gaps
- ✅ **Implemented LLM-as-Judge** for quality assurance
- ✅ **Designed reusable templates** with parameterization
- ✅ **Learned best practices** for production deployment

### 🔑 Key Takeaways

**1. Combine Tactics Strategically**
- Real-world prompts use multiple tactics together
- **Role + Structure + CoT + Judge = Robust workflow**
- Each tactic amplifies the others

**2. Templates Enable Scale**
- Parameterized templates reduce prompt drift
- Version control ensures consistency over time
- Team collaboration becomes possible and repeatable
- Documentation turns templates into shared assets

**3. Quality Assurance Matters**
- LLM-as-Judge catches issues early, before they reach production
- Rubrics encode team standards in executable form
- Iterative refinement improves quality over time
- Metrics provide objective feedback loops

**4. Prepare for Production**
- Test templates thoroughly on diverse scenarios
- Document parameters and usage clearly
- Monitor performance (latency, cost, quality scores)
- Iterate based on real-world feedback

---

### 📚 What's Next?

**Apply What You've Learned:**

1. **Create templates for your team** 
   - Start with code review or test generation
   - Adapt examples from this module to your domain
   - Share with 2-3 teammates for feedback

2. **Integrate into your workflow**
   - Begin with manual use in daily work
   - Add to CI/CD when templates are stable
   - Consider IDE extensions or Slack bots

3. **Collect feedback and iterate**
   - Track what works and what doesn't
   - Use LLM-as-Judge for objective metrics
   - Version templates as they improve

4. **Share with your team**
   - Build a template library in your repo
   - Document usage patterns and best practices
   - Create a feedback channel for continuous improvement

**Continue Learning:**

- **Module 4**: Integration - Connect prompts to your development workflow (CI/CD, IDE, APIs)
- **Advanced Topics**: Multi-agent systems, prompt optimization, cost/latency tradeoffs
- **Community**: Share your templates and learn from others

---

<div style="margin-top:24px; color:#1e3a8a; padding:16px; background:#dbeafe; border-radius:6px; border-left:4px solid #3b82f6;">
<strong>🚀 Ready for Real-World Impact:</strong> <br><br>
You now have the skills to design production-ready prompt engineering workflows for software development. The templates you've learned aren't just exercises—they're patterns used by engineering teams at scale.<br><br>
Go build something amazing! 🎯
</div>

---

### 🙏 Thank You!

Thank you for completing Module 3! Your journey from learning individual tactics to building complete workflows demonstrates real growth in prompt engineering expertise.

**Questions or feedback?** Open an issue in the repository or reach out to the maintainers. We'd love to hear how you're applying these techniques!

**Next:** [Continue to Module 4: Integration](../module-04-integration/README.md)
