From d9a23e92173ceff93f21329c0914bd2fbb237641 Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Tue, 14 Oct 2025 09:58:17 +0530 Subject: [PATCH 1/6] Update README to reflect revised total time for course completion, increasing from ~90 minutes to ~300 minutes (~4-5 hours) for better accuracy in user expectations. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e848c20..a9f3f76 100644 --- a/README.md +++ b/README.md @@ -169,7 +169,7 @@ Each module notebook has **two sections** for tracking progress: - βœ… **Prompt Engineering Toolkit** with reusable patterns and commands - βœ… **Production-Ready Workflows** for code quality, debugging, and API integration -**Total Time**: ~90 minutes (can be split into 3Γ—30min sessions) +**Total Time**: ~300 minutes (~4-5 hours) --- From 9488243f9b8b24f5bb7a7e8c132858bde2df9eca Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Tue, 14 Oct 2025 10:19:04 +0530 Subject: [PATCH 2/6] Revise README and Module 2 notebook to enhance clarity on Module 3 objectives, updating descriptions for code review, debugging, and refactoring techniques. Improved next steps section to reflect new learning outcomes and best practices for software development workflows. --- 01-course/module-02-fundamentals/module2.ipynb | 9 +++++---- README.md | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/01-course/module-02-fundamentals/module2.ipynb b/01-course/module-02-fundamentals/module2.ipynb index 224738e..339a0b2 100644 --- a/01-course/module-02-fundamentals/module2.ipynb +++ b/01-course/module-02-fundamentals/module2.ipynb @@ -3103,10 +3103,11 @@ "### Next Steps\n", "\n", "Continue to **Module 3: Advanced Software Engineering Applications** where you'll learn:\n", - "- Building prompts for complex refactoring scenarios\n", - "- Creating systematic testing and QA workflows\n", - "- Designing effective debugging and performance optimization prompts\n", - "- Developing API integration and documentation helpers\n" + "- Implement prompts for code review, debugging, documentation, and refactoring\n", + "- Design reusable prompt templates for software engineering workflows\n", + "- Evaluate prompt effectiveness and output quality\n", + "- Refine templates based on feedback and edge cases\n", + "- Apply best practices for SDLC integration\n" ] } ], diff --git a/README.md b/README.md index a9f3f76..8989e58 100644 --- a/README.md +++ b/README.md @@ -149,8 +149,8 @@ Each module notebook has **two sections** for tracking progress: ### 1. **Interactive Course** - Learn the fundamentals - **[Module 1: Foundations](./01-course/module-01-foundations/)** - Interactive notebook (`.ipynb`) with environment setup & prompt anatomy (20 min) - **[Module 2: Core Techniques](./01-course/module-02-fundamentals/)** - Interactive notebook (`.ipynb`) with role prompting, structured inputs, few-shot examples, chain-of-thought reasoning, reference citations, prompt chaining, and evaluation techniques (90-120 min) -- **[Module 3: Applications](./01-course/module-03-applications/)** - Interactive notebook (`.ipynb`) with code quality, testing, debugging (30 min) -- **[Module 4: Integration](./01-course/module-04-integration/)** - Interactive notebook (`.ipynb`) with custom commands & AI assistants (10 min) +- **[Module 3: Applications](./01-course/module-03-applications/)** - Interactive notebook (`.ipynb`) with reusable prompt templates for code review, debugging, refactoring, and SDLC workflows (60 min) +- **[Module 4: Integration](./01-course/module-04-integration/)** - Interactive notebook (`.ipynb`) with custom commands & AI assistants (30 min) ### 2. **Practice** - Reinforce learning - **Hands-on Exercises** - Integrated into each module to reinforce concepts From 3ff2a317515a0b33dec455db33655e660a6e711d Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Fri, 17 Oct 2025 17:49:17 +0530 Subject: [PATCH 3/6] Add Module 3: Applications with setup, code review, and test generation notebooks - Introduced `3.1-setup-and-introduction.ipynb` for environment setup and prompt engineering overview. - Added `3.2-code-review-automation.ipynb` to create reusable code review templates using multiple prompting tactics. - Created `setup_utils.py` for shared setup functions and utilities across notebooks. - Developed README for Module 3 outlining objectives, prerequisites, and structured learning paths. - Included `requirements.txt` for necessary dependencies and `activities` directory for hands-on practice. - Added solution files for Activity 3.1 to demonstrate a comprehensive code review template. --- .../3.1-setup-and-introduction.ipynb | 318 ++++++++++ .../3.2-code-review-automation.ipynb | 595 ++++++++++++++++++ 01-course/module-03-applications/README.md | 56 ++ .../activities/activity-3.1-code-review.md | 298 +++++++++ .../module-03-applications/requirements.txt | 11 + .../module-03-applications/setup_utils.py | 504 +++++++++++++++ .../activity-3.1-code-review-solution.md | 349 ++++++++++ 7 files changed, 2131 insertions(+) create mode 100644 01-course/module-03-applications/3.1-setup-and-introduction.ipynb create mode 100644 01-course/module-03-applications/3.2-code-review-automation.ipynb create mode 100644 01-course/module-03-applications/README.md create mode 100644 01-course/module-03-applications/activities/activity-3.1-code-review.md create mode 100644 01-course/module-03-applications/requirements.txt create mode 100644 01-course/module-03-applications/setup_utils.py create mode 100644 01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md diff --git a/01-course/module-03-applications/3.1-setup-and-introduction.ipynb b/01-course/module-03-applications/3.1-setup-and-introduction.ipynb new file mode 100644 index 0000000..a6f8855 --- /dev/null +++ b/01-course/module-03-applications/3.1-setup-and-introduction.ipynb @@ -0,0 +1,318 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Section 3.1: Setup & Introduction\n", + "\n", + "| **Aspect** | **Details** |\n", + "|-------------|-------------|\n", + "| **Goal** | Set up your environment and understand how to apply prompt engineering to SDLC tasks |\n", + "| **Time** | ~15-20 minutes |\n", + "| **Prerequisites** | Module 2 completion, Python 3.8+, IDE with notebook support, API access (GitHub Copilot, CircuIT, or OpenAI) |\n", + "| **Next Steps** | Continue to Section 3.2: Code Review Automation |\n", + "\n", + "---\n", + "\n", + "## πŸš€ Ready to Start?\n", + "\n", + "
\n", + "⚠️ Important:

\n", + "This module builds directly on Module 2 techniques. Make sure you've completed Module 2 before starting.
\n", + "
\n", + "\n", + "## πŸ“š Module 3 Overview\n", + "\n", + "This module has 3 sections (~2 hours total):\n", + "\n", + "1. **Setup & Introduction** (this notebook) β€” 15 minutes \n", + " Get your environment ready\n", + "\n", + "2. **Code Review Automation** β€” 40 minutes \n", + " Build a template that reviews code for security, performance, and quality issues\n", + "\n", + "3. **Test Generation Automation** β€” 35 minutes \n", + " Create prompts that automatically generate unit tests\n", + "\n", + "4. **Test Generation Automation** β€” 35 minutes \n", + " Create prompts that automatically generate unit tests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ Setup: Environment Configuration\n", + "\n", + "### Step 1: Install Required Dependencies\n", + "\n", + "Let's start by installing the packages we need for this tutorial.\n", + "\n", + "Run the cell below. You should see a success message when installation completes:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages for Module 3\n", + "import subprocess\n", + "import sys\n", + "\n", + "def install_requirements():\n", + " try:\n", + " # Install from requirements.txt\n", + " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"-r\", \"requirements.txt\"])\n", + " print(\"βœ… SUCCESS! Module 3 dependencies installed successfully.\")\n", + " print(\"πŸ“¦ Ready for: openai, anthropic, python-dotenv, requests\")\n", + " except subprocess.CalledProcessError as e:\n", + " print(f\"❌ Installation failed: {e}\")\n", + " print(\"πŸ’‘ Try running: pip install openai anthropic python-dotenv requests\")\n", + "\n", + "install_requirements()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Load Setup Utilities\n", + "\n", + "
\n", + "\n", + "πŸ’‘ New Approach:

\n", + "We've extracted all setup code into setup_utils.py - a reusable module! This means:\n", + "
    \n", + "
  • βœ… Setup runs once, works everywhere
  • \n", + "
  • βœ… Sections 3.2, 3.3, and 3.4 just import and go
  • \n", + "
  • βœ… Includes helper functions for testing activities
  • \n", + "
\n", + "
\n", + "\n", + "
\n", + "πŸ’‘ Note:

\n", + "The code below runs on your local machine and connects to AI services over the internet.\n", + "
\n", + "\n", + "**Configure your AI provider:**\n", + "- **Option A: GitHub Copilot API (local proxy)** ⭐ **Recommended**\n", + " - Supports both Claude and OpenAI models\n", + " - No API keys needed\n", + " - Follow [GitHub-Copilot-2-API/README.md](../../GitHub-Copilot-2-API/README.md)\n", + " \n", + "- **Option B/C:** Edit `setup_utils.py` if using OpenAI API or CircuIT directly\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Load the setup utilities module\n", + "from setup_utils import *\n", + "\n", + "print(\"βœ… Setup utilities loaded successfully!\")\n", + "print(f\"πŸ€– Provider: {PROVIDER.upper()}\")\n", + "print(f\"πŸ“ Default model: {get_default_model()}\")\n", + "print()\n", + "\n", + "# Test the connection\n", + "print(\"πŸ§ͺ Testing connection...\")\n", + "if test_connection():\n", + " print()\n", + " print(\"=\"*70)\n", + " print(\"πŸŽ‰ Setup complete! You're ready to continue.\")\n", + " print(\"=\"*70)\n", + "else:\n", + " print()\n", + " print(\"⚠️ Connection test failed. Please check:\")\n", + " print(\" 1. Is GitHub Copilot proxy running on port 7711?\")\n", + " print(\" 2. Did you follow the setup instructions?\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "\n", + "πŸ’‘ What's Available?

\n", + "The setup_utils.py module provides these functions:

\n", + "Core Functions:\n", + "
    \n", + "
  • get_chat_completion(messages) - Send prompts to AI
  • \n", + "
  • get_default_model() - Get current model name
  • \n", + "
  • test_connection() - Test AI connection
  • \n", + "
\n", + "\n", + "Activity Testing Functions:\n", + "
    \n", + "
  • test_activity(file, code, variables) - Test any activity template
  • \n", + "
  • test_activity_3_1(code, variables) - Quick test for Activity 3.1
  • \n", + "
  • test_activity_3_2(code, variables) - Quick test for Activity 3.2
  • \n", + "
  • list_activities() - Show available activities
  • \n", + "
\n", + "\n", + "These will be used in Parts 2 and 3!\n", + "
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎯 Applying Prompt Engineering to SDLC Tasks\n", + "\n", + "---\n", + "\n", + "### Introduction: From Tactics to Real-World Applications\n", + "\n", + "#### πŸš€ Ready to Transform Your Development Workflow?\n", + "\n", + "You've successfully mastered the core tactics in Module 2. Now comes the exciting part - **applying these techniques to real-world software engineering challenges** that you face every day.\n", + "\n", + "Think of what you've accomplished so far as **learning individual martial arts moves**. Now we're going to **choreograph them into powerful combinations** that solve actual development problems.\n", + "\n", + "#### πŸ‘¨β€πŸ’» What You're About to Master\n", + "\n", + "In the next sections, you'll discover **how to combine tactics strategically** to build production-ready prompts for critical SDLC tasks:\n", + "\n", + "
\n", + "\n", + "
\n", + "πŸ” Code Review Automation
\n", + "Comprehensive review prompts with structured feedback\n", + "
\n", + "\n", + "
\n", + "πŸ§ͺ Test Generation Automation
\n", + "Smart test plans with coverage gap analysis\n", + "
\n", + "\n", + "
\n", + "βš–οΈ Quality Validation
\n", + "LLM-as-Judge rubrics for output verification\n", + "
\n", + "\n", + "
\n", + "πŸ“‹ Reusable Templates
\n", + "Parameterized prompts for CI/CD integration\n", + "
\n", + "\n", + "
\n", + "\n", + "
\n", + "πŸ’‘ Pro Tip:

\n", + "This module covers practical applications over 90 minutes across 3 parts. Take short breaks between parts to reflect on how each template applies to your projects. Make notes as you progress. The key skill is learning which tactic combinations solve which problems!\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🎨 Technique Spotlight: Strategic Combinations\n", + "\n", + "Here's how Module 2 tactics combine to solve real SDLC challenges:\n", + "\n", + "| **Technique** | **Purpose in SDLC Context** | **Prompting Tip** |\n", + "|---------------|----------------------------|-------------------|\n", + "| **Task Decomposition** | Break multifaceted engineering tasks into manageable parts | Structure prompt into numbered steps or XML blocks |\n", + "| **Role Prompting** | Align the model's persona with engineering expectations | Specify domain, experience level, and evaluation criteria |\n", + "| **Chain-of-Thought** | Ensure reasoning is visible, aiding traceability and auditing | Request structured reasoning before conclusions |\n", + "| **LLM-as-Judge** | Evaluate code changes or generated artifacts against standards | Provide rubric with weighted criteria and evidence requirement |\n", + "| **Few-Shot Examples** | Instill preferred review tone, severity labels, or test formats | Include short exemplars with both input and expected reasoning |\n", + "| **Prompt Templates** | Reduce prompt drift across teams and tools | Parameterize sections (`{{code_diff}}`, `{{requirements}}`) for reuse |\n", + "\n", + "#### πŸ”— The Power of Strategic Combinations\n", + "\n", + "The real skill isn't using tactics in isolationβ€”it's knowing **which combinations solve which problems**. Each section demonstrates a different combination pattern optimized for specific SDLC challenges.\n", + "\n", + "Ready to build production-ready solutions? Let's dive in! πŸ‘‡\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βœ… Setup Complete!\n", + "\n", + "
\n", + "πŸŽ‰ You're All Set!

\n", + "\n", + "Your environment is configured and ready. Here's what you have:\n", + "\n", + "βœ… **AI Connection** - Tested and working
\n", + "βœ… **Setup Utilities** - Loaded and available
\n", + "βœ… **Activity Helpers** - Ready for hands-on practice
\n", + "βœ… **Understanding** - You know what's coming next
\n", + "
\n", + "\n", + "### ⏭️ Next Steps\n", + "\n", + "**Continue to Section 2:**\n", + "1. Open [`3.2-code-review-automation.ipynb`](./3.2-code-review-automation.ipynb)\n", + "2. The setup will already be loaded - just import from `setup_utils`!\n", + "3. Learn how to build production-ready code review templates\n", + "4. Complete Activity 3.1 in your own `.md` file\n", + "\n", + "**πŸ’‘ Tip:** Keep this notebook open in case you need to troubleshoot the connection later.\n", + "\n", + "---\n", + "\n", + "### πŸ”— Quick Links\n", + "\n", + "- **Next:** [Section 2: Code Review Automation](./3.2-code-review-automation.ipynb)\n", + "- **Activities:** [Browse Activities](./activities/README.md)\n", + "- **Solutions:** [View Solutions](./solutions/README.md)\n", + "- **Main README:** [Module 3 Overview](./README.md)\n", + "\n", + "---\n", + "\n", + "**Ready to continue?** [πŸš€ Open Section 2 now](./3.2-code-review-automation.ipynb)!\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01-course/module-03-applications/3.2-code-review-automation.ipynb b/01-course/module-03-applications/3.2-code-review-automation.ipynb new file mode 100644 index 0000000..de8016a --- /dev/null +++ b/01-course/module-03-applications/3.2-code-review-automation.ipynb @@ -0,0 +1,595 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Section 3.2: Code Review Automation\n", + "\n", + "| **Aspect** | **Details** |\n", + "|-------------|-------------|\n", + "| **Goal** | Build production-ready code review templates that combine multiple prompt engineering tactics |\n", + "| **Time** | ~40 minutes |\n", + "| **Prerequisites** | Section 1 complete, setup_utils.py loaded |\n", + "| **What You'll Learn** | Strategic tactic combinations, template parameterization, comprehensive code reviews |\n", + "| **Next Steps** | Continue to Section 3.3: Test Case Automation |\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ Quick Setup Check\n", + "\n", + "Since you completed Section 1, setup is already done! We just need to import it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Quick setup check - imports setup_utils\n", + "try:\n", + " import importlib\n", + " import setup_utils\n", + " importlib.reload(setup_utils)\n", + " from setup_utils import *\n", + " print(f\"βœ… Setup loaded! Using {PROVIDER.upper()} with {get_default_model()}\")\n", + " print(\"πŸš€ Ready to build code review templates!\")\n", + "except ImportError:\n", + " print(\"❌ Setup not found!\")\n", + " print(\"πŸ’‘ Please run 3.1-setup-and-introduction.ipynb first to set up your environment.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ” Code Review Automation Template\n", + "\n", + "### Building a Comprehensive Code Review Prompt with a Multi-Tactic Stack\n", + "\n", + "
\n", + "🎯 What You'll Build in This Section

\n", + "\n", + "You'll create a **code review prompt template** that automatically checks code like an experienced engineer would. The prompt template will assist to find bugs, security issues, and quality problems, and provide clear suggestions on how to fix them.\n", + "\n", + "**Time Required:** ~40 minutes (includes learning, examples, and hands-on activity)\n", + "
\n", + "\n", + "Layering tactics is the key to getting that level of rigor. Each block in the template leans on a different Module 2 technique so the model moves from context β†’ reasoning β†’ decision without dropping details. We'll call out those tactical touchpoints as you work through the section.\n", + "\n", + "#### 🎯 The Problem We're Solving\n", + "\n", + "Manual code reviews face three critical challenges:\n", + "\n", + "1. **⏰ Time Bottlenecks** \n", + " - Senior engineers spend 8-12 hours/week reviewing PRs\n", + " - Review queues delay feature delivery by 2-3 days on average\n", + " - **Impact:** Slower velocity, frustrated developers\n", + "\n", + "2. **🎯 Inconsistent Standards**\n", + " - Different reviewers prioritize different concerns\n", + " - New team members lack institutional knowledge\n", + " - Review quality varies based on reviewer fatigue\n", + " - **Impact:** Technical debt accumulates, security gaps emerge\n", + "\n", + "3. **πŸ“ Lost Knowledge**\n", + " - Review reasoning buried in PR comments\n", + " - No searchable audit trail for security decisions\n", + " - Hard to train junior developers on review standards\n", + " - **Impact:** Repeated mistakes, difficult compliance auditing\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### πŸ—οΈ How We'll Build It: The Tactical Combination\n", + "\n", + "We assemble this template by chaining together five Module 2 tactics. The table recaps what each tactic contributes and the callouts below map them to concrete sections of the prompt.\n", + "\n", + "| **Tactic** | **Purpose in This Template** | **Why Modern LLMs Need This** |\n", + "|------------|------------------------------|-------------------------------|\n", + "| **Role Prompting** | Establishes \"Senior Backend Engineer\" perspective with specific expertise | LLMs respond better when given explicit expertise context rather than assuming generic knowledge |\n", + "| **Structured Inputs (XML)** | Separates code, context, and guidelines into clear sections | Prevents models from mixing different information types during analysis |\n", + "| **Task Decomposition** | Breaks review into 4 sequential steps (Think β†’ Assess β†’ Suggest β†’ Verdict) | Advanced LLMs excel at following explicit numbered steps rather than implicit workflows |\n", + "| **Chain-of-Thought** | Makes reasoning visible in Analysis section | Improves accuracy by forcing deliberate analysis before conclusions |\n", + "| **Structured Output** | Uses readable markdown format with severity levels | Enables human readability while maintaining parseable structure for automation |\n", + "\n", + "\n", + "
\n", + "Choosing XML vs Markdown for Prompting LLMs

\n", + "The effectiveness of the format can change depending on the AI model. It depends on the complexity and length of the prompt structure, but any notation the model can accurately understand is fine, and maintainability on the human side is also important.\n", + "

Pick the structure that keeps instructions crystal clear:

\n", + "
    \n", + "
  • Match the model:\n", + "
      \n", + "
    • Claude is tuned for XML.
    • \n", + "
    • GPT-4/5 works with XML or Markdown; experiment both with your workflow.
    • \n", + "
    • Llama-class or other open models usually prefer XML on complex prompts.
    • \n", + "
    \n", + "
  • \n", + "
  • Match the prompt:\n", + "
      \n", + "
    • Short prompts can stay in Markdown.
    • \n", + "
    • Multi-section prompts gain clarity from XML because each block (role, context, examples) is explicitly tagged.
    • \n", + "
    \n", + "
  • Match the stakes:\n", + "
      \n", + "
    • Markdown saves tokens for lightweight tasks.
    • \n", + "
    • When accuracy matters more than cost, XML’s structure often pays off.
    • \n", + "
    \n", + "
  • Label everything: Whatever format you pick, clearly separate context, instructions, and examplesβ€”use descriptive tags in XML or consistent headings in Markdown.
  • \n", + "
\n", + "
\n", + "
\n", + "
\n", + "\n", + "πŸš€ Let's Build It!

\n", + "\n", + "In the next cell, you'll see the complete template structure. **Pay special attention to**:\n", + "- How we use explicit language to define severity levels (not \"bad code\" but \"allows SQL injection\")\n", + "- Why the markdown output format is more readable than XML while still being parseable\n", + "- How parameters like `{{tech_stack}}` and `{{change_purpose}}` make the template reusable across projects\n", + "- How the 6 review dimensions (Security, Performance, Error Handling, etc.) ensure comprehensive analysis\n", + "\n", + "After reviewing the template, you'll test it on real code and see how each tactic contributes to the result.\n", + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ€” Why Templatize Prompts?\n", + "\n", + "Templating turns good prompting habits into a repeatable system. Instead of rewriting long instructions, you:\n", + "\n", + "- Swap in new repos, services, or code diffs with `{{variables}}`\n", + "- Guarantee every review covers the same dimensions and severity language\n", + "- Reduce drift as teammates inherit a proven prompt rather than inventing their own\n", + "- Make automation easy because the structure is predictable\n", + "\n", + "Want a deeper dive? Claude's guidance on prompt templates and variables breaks down when to parameterise and how to organise reusable snippets: [Prompt templates & variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“‹ Template Structure\n", + "\n", + "Break the template into focused pieces so the model never has to parse a wall of XML. Each block is paired with the tactic that keeps it reliable.\n", + "\n", + "\n", + "**Where the tactics show up in the template:**\n", + "\n", + "| Template Block | What It Does | Tactic Used |\n", + "| --- | --- | --- |\n", + "| **1. ``** | Sets the reviewer persona (e.g., Senior Python Engineer) | Role Prompting |\n", + "| **2. ``** | Provides repository, service, and change purpose | Structured Inputs |\n", + "| **3. ``** | Contains the code changes to review | Structured Inputs |\n", + "| **4. ``** | Lists what to check (security, performance, quality, etc.) | Task Decomposition |\n", + "| **5. ``** | Guides the model: Think β†’ Assess β†’ Suggest β†’ Verdict | Task Decomposition + Chain-of-Thought |\n", + "| **6. ``** | Defines the structure: Summary β†’ Findings table β†’ Verdict | Structured Output |\n", + "\n", + "You can reuse this template for different projects by swapping variables like `{{repo_name}}`, `{{change_purpose}}`, or `{{tech_stack}}`. The review tactics and quality checks remain consistent across all uses.\n", + "\n", + "````xml\n", + "\n", + "Act as a Senior Software Engineer specializing in {{tech_stack}} backend services.\n", + "\n", + "\n", + "\n", + "Repository: {{repo_name}}\n", + "Service: {{service_name}}\n", + "Change Purpose: {{change_purpose}}\n", + "Language: {{lang}}\n", + "\n", + "\n", + "\n", + "{{code_diff}}\n", + "\n", + "\n", + "\n", + "Assess the change across:\n", + "1. Security (auth, data handling, injection)\n", + "2. Reliability and correctness\n", + "3. Performance and resource usage\n", + "4. Maintainability and readability\n", + "5. Observability and logging\n", + "\n", + "\n", + "\n", + "1. Think through the change and note risks.\n", + "2. Analyse the code against the review guidelines.\n", + "3. Suggest fixes with concrete recommendations.\n", + "4. Deliver a final verdict (approve, needs work, block).\n", + "\n", + "\n", + "\n", + "## Summary\n", + "[One paragraph that captures overall review stance.]\n", + "\n", + "## Findings\n", + "### [SEVERITY] Issue Title\n", + "**Category:** [Security / Performance / Quality / Correctness / Best Practices]\n", + "**Line:** [line number]\n", + "**Issue:** [impact-focused description]\n", + "**Recommendation:**\n", + "```{{lang}}\n", + "# safer / faster / cleaner fix here\n", + "```\n", + "\n", + "## Verdict\n", + "- Decision: [Approve / Needs Changes / Block]\n", + "- Rationale: [Why you chose this verdict]\n", + "\n", + "````\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ’» Working Example: Comprehensive Review Walkthrough\n", + "\n", + "Now let's see the template in action! We'll review a realistic code change: a monthly report exporter that touches database queries, caching, and S3 uploads.\n", + "\n", + "**What to look for:**\n", + "- Each section of the prompt is marked with comments like ``\n", + "- Match each block back to the structure table above\n", + "- Notice how the 6 blocks work together to produce a thorough review\n", + "\n", + "Run the cell below to see the complete prompt and the model's response.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Comprehensive Code Review aligned with the six-block template\n", + "code_diff = '''\n", + "+ import json\n", + "+ import time\n", + "+ from decimal import Decimal\n", + "+\n", + "+ CACHE = {}\n", + "+\n", + "+ def generate_monthly_report(org_id, db, s3_client):\n", + "+ if org_id in CACHE:\n", + "+ return CACHE[org_id]\n", + "+\n", + "+ query = f\"SELECT * FROM invoices WHERE org_id = '{org_id}' ORDER BY created_at DESC\"\n", + "+ rows = db.execute(query)\n", + "+\n", + "+ total = Decimal(0)\n", + "\n", + "+ items = []\n", + "+ for row in rows:\n", + "+ total += Decimal(row['amount'])\n", + "+ items.append({\n", + "+ 'id': row['id'],\n", + "+ 'customer': row['customer_name'],\n", + "+ 'amount': float(row['amount'])\n", + "+ })\n", + "+\n", + "+ payload = {\n", + "+ 'org': org_id,\n", + "+ 'generated_at': time.strftime('%Y-%m-%d %H:%M:%S'),\n", + "+ 'total': float(total),\n", + "+ 'items': items\n", + "+ }\n", + "+\n", + "+ key = f\"reports/{org_id}/{int(time.time())}.json\"\n", + "+ time.sleep(0.5)\n", + "+ s3_client.put_object(\n", + "+ Bucket='company-reports',\n", + "+ Key=key,\n", + "+ Body=json.dumps(payload),\n", + "+ ACL='public-read'\n", + "+ )\n", + "+\n", + "+ CACHE[org_id] = key\n", + "+ return key\n", + "'''\n", + "\n", + "messages = [\n", + " {\n", + " 'role': 'system',\n", + " 'content': 'You follow structured review templates and produce clear, actionable findings.'\n", + " },\n", + " {\n", + " 'role': 'user',\n", + " 'content': f'''\n", + "\n", + "\n", + "Act as a Senior Software Engineer specializing in Python backend services.\n", + "Your expertise covers security best practices, performance tuning, reliability, and maintainable design.\n", + "\n", + "\n", + "\n", + "\n", + "Repository: analytics-platform\n", + "Service: Reporting API\n", + "Purpose: Add a monthly invoice report exporter that finance can trigger\n", + "Change Scope: Review focuses on the generate_monthly_report implementation\n", + "Language: python\n", + "\n", + "\n", + "\n", + "\n", + "{code_diff}\n", + "\n", + "\n", + "\n", + "\n", + "Assess the change across multiple dimensions:\n", + "1. Security β€” SQL injection, S3 object exposure, sensitive data handling.\n", + "2. Performance β€” query efficiency, blocking calls, caching behaviour.\n", + "3. Error Handling β€” resilience to empty results, network/storage failures.\n", + "4. Code Quality β€” readability, global state, data conversions.\n", + "5. Correctness β€” totals, currency precision, repeated report generation.\n", + "6. Best Practices β€” configuration management, separation of concerns, testing hooks.\n", + "For each finding, cite the diff line, describe impact, and share an actionable fix.\n", + "\n", + "\n", + "\n", + "\n", + "Step 1 - Think: Analyse the diff using the dimensions listed above.\n", + "Step 2 - Assess: For each issue, capture Severity (CRITICAL/MAJOR/MINOR/INFO), Category, Line, Issue, Impact.\n", + "Step 3 - Suggest: Provide a concrete remediation (code change or process tweak).\n", + "Step 4 - Verdict: Summarise overall risk and recommend APPROVE / REQUEST CHANGES / NEEDS WORK.\n", + "\n", + "\n", + "\n", + "\n", + "## Code Review Summary\n", + "[One paragraph on overall health and primary risks]\n", + "\n", + "## Findings\n", + "### [SEVERITY] Issue Title\n", + "**Category:** [Security / Performance / Quality / Correctness / Best Practices]\n", + "**Line:** [line number]\n", + "**Issue:** [impact-focused description]\n", + "**Recommendation:**\n", + "```\n", + "# safer / faster / cleaner fix here\n", + "```\n", + "\n", + "## Overall Assessment\n", + "**Recommendation:** [APPROVE | REQUEST CHANGES | NEEDS WORK]\n", + "**Summary:** [What to address before merge]\n", + "\n", + "'''\n", + " }\n", + "]\n", + "\n", + "print('πŸ” COMPREHENSIVE CODE REVIEW IN PROGRESS...')\n", + "print('=' * 70)\n", + "review_result = get_chat_completion(messages, temperature=0.0)\n", + "print(review_result)\n", + "print('=' * 70)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ‹οΈ Hands-On Practice: Activity 3.1\n", + "\n", + "
\n", + "πŸ“ Activity Time: Work in Your Own File!

\n", + "\n", + "Instead of working in this notebook, you'll complete this activity in a dedicated markdown file. This gives you:\n", + "
    \n", + "
  • βœ… A clean workspace for your solutions
  • \n", + "
  • βœ… Easy file sharing with instructors/peers
  • \n", + "
  • βœ… Better version control (commit your progress!)
  • \n", + "
  • βœ… Reusable templates for your projects
  • \n", + "
\n", + "
\n", + "\n", + "### 🎯 What You'll Build\n", + "\n", + "A production-ready code review template by researching AWS patterns and applying them to comprehensive code review.\n", + "\n", + "**Time Required:** 30-40 minutes\n", + "\n", + "### πŸ“ Instructions\n", + "\n", + "1. **Open the activity file:** [`activities/activity-3.1-code-review.md`](./activities/activity-3.1-code-review.md)\n", + "2. **Follow the 3-step process:**\n", + " - **Step 1 (10-15 min):** Research AWS code review patterns\n", + " - **Step 2 (10-15 min):** Design your template (answer planning questions)\n", + " - **Step 3 (15-20 min):** Build your template between the markers\n", + "3. **Test your template** using the helper function below\n", + "4. **Compare with solution** when done: [`solutions/activity-3.1-code-review-solution.md`](./solutions/activity-3.1-code-review-solution.md)\n", + "\n", + "### πŸ§ͺ Testing Your Activity\n", + "\n", + "Use the helper function below to test your template directly from the activity file!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "\n", + "⚠️ IMPORTANT: Complete your template in the activity file BEFORE running this!\n", + "

\n", + "Steps to complete first:\n", + "
    \n", + "
  • Open activities/activity-3.1-code-review.md
  • \n", + "
  • Replace all <!-- TODO: ... --> comments with your actual content
  • \n", + "
  • Fill in role, guidelines, tasks, and output format sections
  • \n", + "
  • Save the file, then come back and run the cell below
  • \n", + "
\n", + "
\n", + "\n", + "
\n", + "πŸ’‘ Model Quirk:

\n", + "Sometimes the AI might start by quoting a line from your code (like user = rows[0]) before giving the actual review. This is normal model behavior and doesn't affect the quality of your results.\n", + "

\n", + "If this happens:\n", + "
    \n", + "
  • Just ignore the quoted line - the rest of the review will be complete and properly formatted
  • \n", + "
  • Or re-run the cell - it might not happen the second time
  • \n", + "
\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Test your Activity 3.1 template\n", + "\n", + "# This is the vulnerable authentication code from the activity\n", + "test_code = \"\"\"\n", + "+ import hashlib\n", + "+ import time\n", + "+\n", + "+ SESSION_CACHE = {}\n", + "+\n", + "+ def authenticate_user(db, username, password):\n", + "+ username = username or \"\"\n", + "+ password = password or \"\"\n", + "+\n", + "+ query = f\"SELECT id, password_hash, failed_attempts FROM users WHERE username = '{username}'\"\n", + "+ rows = db.execute(query)\n", + "+ user = rows[0]\n", + "+\n", + "+ hashed = hashlib.md5(password.encode()).hexdigest()\n", + "+\n", + "+ if hashed != user[\"password_hash\"]:\n", + "+ db.execute(f\"UPDATE users SET failed_attempts = {user['failed_attempts'] + 1} WHERE id = {user['id']}\")\n", + "+ return {\"status\": \"error\"}\n", + "+\n", + "+ if username not in SESSION_CACHE:\n", + "+ SESSION_CACHE[username] = f\"{user['id']}-{int(time.time())}\"\n", + "+\n", + "+ permissions = []\n", + "+ for role in db.fetch_roles():\n", + "+ if db.has_role(user[\"id\"], role[\"id\"]):\n", + "+ permissions.append(role[\"name\"])\n", + "+\n", + "+ time.sleep(0.5)\n", + "+ db.write_audit_entry(user[\"id\"], username)\n", + "+\n", + "+ return {\"status\": \"ok\", \"session\": SESSION_CACHE[username], \"permissions\": permissions}\n", + "\"\"\"\n", + "\n", + "# Run this to test your template from the activity file\n", + "test_activity_3_1(\n", + " test_code=test_code,\n", + " variables={\n", + " 'tech_stack': 'Python',\n", + " 'repo_name': 'user-auth-service',\n", + " 'service_name': 'Authentication API',\n", + " 'change_purpose': 'Add user login endpoint'\n", + " }\n", + ")\n", + "\n", + "# The function will:\n", + "# 1. Read your template from activities/activity-3.1-code-review.md\n", + "# 2. Substitute the variables\n", + "# 3. Send to the AI model\n", + "# 4. Display the results\n", + "# 5. Ask if you want to save results back to the activity file\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“š Learn More: Advanced Code Review Patterns\n", + "\n", + "Want to dive deeper into production code review automation?\n", + "\n", + "**πŸ“– AWS Anthropic Advanced Patterns:**\n", + "- [Code Review Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md) β€” Full prompt + workflow for automated reviews\n", + "\n", + "**πŸ”— Related Best Practices:**\n", + "- [Claude 4 Prompt Engineering](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) β€” Guidance on structuring complex instructions\n", + "- [Prompt Templates & Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables) β€” When and how to parameterize prompts\n", + "- [OpenAI GPT-5 Prompting Guide](https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide) β€” Latest guidance on tactic stacking and failure analysis for GPT-5 models\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βœ… Section 2 Complete!\n", + "\n", + "
\n", + "πŸŽ‰ Nice work! You just wrapped up the Code Review Automation section.\n", + "
\n", + "\n", + "**Key takeaways**\n", + "- Combined Module 2 tactics into a reusable review template\n", + "- Practiced comprehensive code reviews after researching template patterns\n", + "- Built confidence using the activity workflow and helper tests\n", + "\n", + "**Next up**\n", + "1. Open [`3.3-test-generation-automation.ipynb`](./3.3-test-generation-automation.ipynb)\n", + "2. Use the same setup to explore LLM-powered test generation\n", + "3. Complete Activity 3.2 in its markdown workspace\n", + "\n", + "\n", + "
\n", + " β˜• Need a pause?\n", + " Give your brain a resetβ€”bookmark the next section, stretch for a minute, and come back with fresh eyes.\n", + "
\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01-course/module-03-applications/README.md b/01-course/module-03-applications/README.md new file mode 100644 index 0000000..9823f19 --- /dev/null +++ b/01-course/module-03-applications/README.md @@ -0,0 +1,56 @@ +# Module 3: Applications + +## Apply Prompt Engineering to SDLC Workflows + +Put advanced prompting tactics into practice by automating code review, test planning, and quality evaluation tasks across the software development lifecycle. + +### Learning Objectives +By completing this module, you will be able to: + +- βœ… Implement prompts that catch defects and gaps during code review, testing, and release gates +- βœ… Design reusable templates parameterized for different services, stacks, and teams +- βœ… Evaluate prompt output with judge rubrics and close the loop with iterative improvements +- βœ… Integrate prompt workflows with CI/CD, quality assurance, and engineering rituals + +### Getting Started + +**First time here?** +- If you haven't set up your development environment yet, follow the [Quick Setup guide](../../README.md#-quick-setup) in the main README first +- **New to Jupyter notebooks?** Read [About Jupyter Notebooks](../../README.md#-about-jupyter-notebooks) to understand how notebooks work and where code executes + +**Ready to start?** +1. **Open Section 3.1**: Start with [3.1-setup-and-introduction.ipynb](./3.1-setup-and-introduction.ipynb) to configure your environment and preview the module +2. **Install dependencies**: Run the "Install Required Dependencies" cell in the notebook or `pip install -r requirements.txt` +3. **Follow the notebook**: Work through each cell sequentiallyβ€”the notebook walks you through setup and exercises +4. **Complete exercises**: Build and test prompts alongside the guided labs and activity files + +> **Note:** Unlike Modules 1 and 2, Module 3 is organized as four linked sections. Work through them in order: + +This module’s sections build on one another: +- **[Section 3.1](./3.1-setup-and-introduction.ipynb)** – Environment setup, provider validation, and module orientation +- **[Section 3.2](./3.2-code-review-automation.ipynb)** – Code review automation patterns with parameterized templates +- **[Section 3.3](./3.3-test-generation-automation.ipynb)** – Test generation automation for translating requirements into suites +- **[Section 3.4](./3.4-llm-as-judge-evaluation.ipynb)** – LLM-as-judge evaluation, rubric design, and automated quality gates + +### Module Contents +- **[3.1-setup-and-introduction.ipynb](./3.1-setup-and-introduction.ipynb)** – Environment checks and provider validation +- **[3.2-code-review-automation.ipynb](./3.2-code-review-automation.ipynb)** – Comprehensive code review workflows +- **[3.3-test-generation-automation.ipynb](./3.3-test-generation-automation.ipynb)** – Requirement-to-test prompt patterns +- **[3.4-llm-as-judge-evaluation.ipynb](./3.4-llm-as-judge-evaluation.ipynb)** – Rubrics and automated quality gates +- **[activities/](./activities/)** – Practice briefs and instructions (`activities/README.md`) +- **[solutions/](./solutions/)** – Reference templates with deep-dive explanations +- **setup_utils.py** – Shared helpers for configuring AI providers and testing templates + +### Time Required +Approximately 120-150 minutes (2-2.5 hours) + +### Prerequisites +- Python 3.8+ installed +- IDE with notebook support (VS Code or Cursor recommended) +- API access to GitHub Copilot, CircuIT, or OpenAI + +### Next Steps +After completing this module: +1. Refine and version your prompt templates using the testing helpers in `setup_utils.py` +2. Compare your work with the solutions directory to identify improvement ideas +3. Continue to [Module 4: Integration](../module-04-integration/) to operationalize prompt engineering across your organization diff --git a/01-course/module-03-applications/activities/activity-3.1-code-review.md b/01-course/module-03-applications/activities/activity-3.1-code-review.md new file mode 100644 index 0000000..a05016a --- /dev/null +++ b/01-course/module-03-applications/activities/activity-3.1-code-review.md @@ -0,0 +1,298 @@ +# Activity 3.1: Build Your Own Code Review Template + +**⏱️ Time Required:** 30-40 minutes +**🎯 Difficulty:** Intermediate +**πŸ“š Prerequisites:** Complete `Section 1` of `3.2-code-review-automation.ipynb` + +--- + +## 🎯 Your Mission + +Create a reusable code review prompt template. You will research an industry pattern, adapt it to your needs, and make sure it catches issues across security, performance, maintainability, and overall code quality. + +--- + +## πŸ“‹ Success Criteria + +Your template should: +- βœ… Review code across security, performance, maintainability, and best practices +- βœ… Call out high-impact issues with line references and severity +- βœ… Suggest clear, actionable fixes +- βœ… Produce a tidy, repeatable review format + +--- + +## πŸ” Scenario Snapshot + +The team maintains a user authentication service. Recent reviews surfaced: +- SQL injection and weak password hashing +- Inefficient database access patterns +- Thin validation and error handling + +Your template must consistently catch issues like these. + +--- + +## πŸ“ Working Plan + +### Step 1 β€” Research (10-15 minutes) +1. Read the [AWS Anthropic Code Review Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md). +2. Jot down how they structure the prompt, the review dimensions they insist on, how severity is defined, and what their output looks like. + +Use the space below for quick notes: +``` +Dimensions to copy or adapt: +- Security: +- Performance: +- Maintainability / Quality: +- Other: + +What keeps feedback actionable? +- +``` + +### Step 2 β€” Blueprint Your Template (10-15 minutes) +Fill in these prompts before you touch the template block (we show XML by default, but you can swap in Markdown later if you prefer): +``` +Role (who is reviewing?): + +Essential context to provide: + +Must-have review dimensions: + +Output shape (sections, severity labels, etc.): + +Any reusable {{variables}} you want: +``` + +### Step 3 β€” Build & Test (15-20 minutes) +1. Scroll to the template block below and edit only the content between `` and ``. +2. Replace placeholder text with your own role, guidelines, tasks, and output format. +3. Stick with the XML shell shown, or switch the code fence (e.g., to ````markdown) and rewrite it in structured Markdownβ€”the tester will capture everything between the markers either way. +4. Save the file, then open `3.2-code-review-automation.ipynb` and run `test_activity_3_1()` to check your work. + +**Helpful reminders** +- Leave the HTML comments (``) in place so the tester can find your template. +- If you keep the XML version, reuse the existing tags (``, ``, etc.) and keep `{{variables}}` for portability. +- If you switch to Markdown, keep the sections clearly labeled (headings or bold labels work well) and leave `{{variables}}` wherever you need substitution. +- Make severity labels and categories meaningful for your team. Think β€œCRITICAL/MAJOR/MINOR/INFO” or similar. + +
+❓ Why do I need those HTML comment markers? (Click to expand) + +The `` and `` markers tell the `test_activity_3_1()` function where your template begins and ends. They're invisible when markdown is rendered but essential for the auto-testing feature! + +
+ +--- + +## πŸ‘‡ YOUR EDITABLE TEMPLATE IS BELOW πŸ‘‡ + +````xml +/******************************************************************************* + * ✏️ EDIT YOUR TEMPLATE BETWEEN THE COMMENT BLOCKS + * + * The test function extracts everything between: + * and + * + * Instructions: + * 1. Replace TODO comments with your content + * 2. Customize guidelines, tasks, and output format + * 3. Keep the structure clear (XML tags or well-labeled Markdown) + * 4. Use {{variables}} for parameterization + * + * Tip: If you convert this to Markdown, change the opening ````xml fence accordingly. + ******************************************************************************/ + + + + + + + + +Repository: {{repo_name}} +Service: {{service_name}} +Purpose: {{change_purpose}} + + + +{{code_diff}} + + + + + + + + + + + + + + + + + + + + + +/******************************************************************************* + * YOUR TEMPLATE ENDS HERE + * + * Next step: Test it! + * Go to: 3.2-code-review-automation.ipynb + * Run: test_activity_3_1(test_code="...", variables={...}) + ******************************************************************************/ +```` + +--- + +### Step 4: Test Your Template + +--- + +**πŸ§ͺ Test in Notebook:** + +Open `3.2-code-review-automation.ipynb` and run: + +```python +# Test with the authentication code vulnerability +test_code = """ ++ import hashlib ++ import time ++ ++ SESSION_CACHE = {} ++ ++ def authenticate_user(db, username, password): ++ username = username or "" ++ password = password or "" ++ ++ query = f"SELECT id, password_hash, failed_attempts FROM users WHERE username = '{username}'" ++ rows = db.execute(query) ++ user = rows[0] ++ ++ hashed = hashlib.md5(password.encode()).hexdigest() ++ ++ if hashed != user["password_hash"]: ++ db.execute(f"UPDATE users SET failed_attempts = {user['failed_attempts'] + 1} WHERE id = {user['id']}") ++ return {"status": "error"} ++ ++ if username not in SESSION_CACHE: ++ SESSION_CACHE[username] = f"{user['id']}-{int(time.time())}" ++ ++ permissions = [] ++ for role in db.fetch_roles(): ++ if db.has_role(user["id"], role["id"]): ++ permissions.append(role["name"]) ++ ++ time.sleep(0.5) ++ db.write_audit_entry(user["id"], username) ++ ++ return {"status": "ok", "session": SESSION_CACHE[username], "permissions": permissions} +""" + + test_activity_3_1( + test_code=test_code, + variables={ + 'tech_stack': 'Python', + 'repo_name': 'user-auth-service', + 'service_name': 'Authentication API', + 'change_purpose': 'Add user login endpoint' + } + ) + ``` + + **Your template's output:** + + ``` + [Results will be automatically saved here when you test] + ``` + + +**Self-Check:** +- [ ] Did it identify security issues (SQL injection, weak hashing, predictable tokens)? +- [ ] Did it flag performance or scalability problems (blocking sleeps, N+1 queries, unnecessary work)? +- [ ] Did it call out maintainability concerns (global state, missing resource cleanup, lack of separation of concerns)? +- [ ] Did it note robustness or testing gaps (silent failures, missing validation, error handling)? +- [ ] Did it provide actionable fixes with severity levels and clear categories? +- [ ] Is the review output well-structured and easy for engineers to follow? + +## βœ… Self-Assessment Checklist + +Before considering this activity complete, verify: + +- [ ] My template reviews multiple dimensions (security, performance, quality) +- [ ] My template identifies common issues in the test code +- [ ] Each finding includes severity rating and category +- [ ] Each finding suggests a concrete fix +- [ ] Output is well-structured and readable +- [ ] Template uses parameterization ({{variables}}) +- [ ] I tested with the provided code sample +- [ ] I drew inspiration from AWS patterns without over-complicating + +--- + +## πŸš€ Next Steps + +### Compare with Solution +Once you're satisfied with your template, compare it with the official solution: +πŸ“– [`solutions/activity-3.1-code-review-solution.md`](../solutions/activity-3.1-code-review-solution.md) + +### Keep Iterating +- Save a copy of your finished template in your repo (for example, `prompts/code-review-template.xml`) so you can reuse and improve it. +- Drop it into your pull-request workflow or CI pipeline and tweak it as you gather feedback from teammates. +- Start a changelog for prompt revisionsβ€”treat it like any other piece of your development toolkit. + +### Reflect +What did you learn? +``` +1. + +2. + +3. +``` + +### Continue Learning +Return to `3.3-test-generation-automation.ipynb` to continue with **Section 2: Test Generation** + +--- + +## πŸ’‘ Need Help? + +**Common Issues:** + +| Problem | Solution | +|---------|----------| +| Template too generic | Add specific checks for each dimension (security, performance, quality) | +| Missing issues | Review AWS patterns - what dimensions do they cover? Be systematic. | +| Output not structured | Use explicit section markers in your `` block | +| No severity levels | Define clear criteria: CRITICAL (security/data loss), MAJOR (bugs), MINOR (quality) | + +**Still stuck?** Check the solution file for guidance, but try to solve it yourself first! + +--- + +**πŸ“… Completed on:** ___________ +**⏱️ Time taken:** ___________ minutes + +--- + +## πŸŽ“ Learning Notes + +Use this space to capture key insights: + +``` +What worked well in my template: + + +What I would improve: + + +How I'll apply this to my projects: + + +``` diff --git a/01-course/module-03-applications/requirements.txt b/01-course/module-03-applications/requirements.txt new file mode 100644 index 0000000..fbd36d0 --- /dev/null +++ b/01-course/module-03-applications/requirements.txt @@ -0,0 +1,11 @@ +openai +langchain +python-dotenv +requests +notebook +ipython +chromadb +google-api-python-client +google-auth-httplib2 +google-auth-oauthlib +anthropic \ No newline at end of file diff --git a/01-course/module-03-applications/setup_utils.py b/01-course/module-03-applications/setup_utils.py new file mode 100644 index 0000000..175b901 --- /dev/null +++ b/01-course/module-03-applications/setup_utils.py @@ -0,0 +1,504 @@ +""" +Module 3: Shared Setup Utilities +This file contains all setup code to avoid repetition across notebooks. +Run setup once, then import these functions in any notebook. +""" + +import openai +import anthropic +import os +import re +from pathlib import Path + +# ============================================ +# 🎯 CONFIGURATION +# ============================================ +# Set your preference: "openai", "claude", or "circuit" +PROVIDER = "claude" + +# Available models by provider +OPENAI_DEFAULT_MODEL = "gpt-5" # Works with OpenAI API, GitHub Copilot +CIRCUIT_DEFAULT_MODEL = "gpt-4o" +CLAUDE_DEFAULT_MODEL = "claude-sonnet-4" + +# ============================================ +# πŸ€– AI CLIENT INITIALIZATION +# ============================================ + +# OPTION A: GitHub Copilot Proxy (Default - Recommended for Course) +# Use local proxy that routes through GitHub Copilot +# Supports both OpenAI and Claude models via single proxy +openai_client = openai.OpenAI( + base_url="http://localhost:7711/v1", + api_key="dummy-key" +) + +claude_client = anthropic.Anthropic( + api_key="dummy-key", + base_url="http://localhost:7711" +) + +# Placeholder for CircuIT client (will be set if Option C is uncommented) +circuit_client = None +circuit_app_key = None + +# OPTION B: Direct OpenAI API +# IMPORTANT: Comment out Option A (lines 27-38) before using this option +# Setup: Add your API key to .env file, then uncomment and run +# from dotenv import load_dotenv +# +# load_dotenv() +# +# openai_client = openai.OpenAI( +# api_key=os.getenv("OPENAI_API_KEY") # Set this in your .env file +# ) +# + + +# OPTION C: CircuIT APIs (Azure OpenAI) +# IMPORTANT: Comment out Option A (lines 27-38) before using this option +# Supported models: gpt-4, gpt-4o (not gpt-5) +# Setup: Configure environment variables in .env file: +# - CISCO_CLIENT_ID +# - CISCO_CLIENT_SECRET +# - CISCO_OPENAI_APP_KEY +# Get values from: https://ai-chat.cisco.com/bridgeit-platform/api/home +# Then uncomment and run (also change PROVIDER to "circuit" at the top): +# import traceback +# import requests +# import base64 +# from dotenv import load_dotenv +# from openai import AzureOpenAI +# +# # Load environment variables +# load_dotenv() +# +# # OpenAI version to use +# openai.api_type = "azure" +# openai.api_version = "2024-12-01-preview" +# +# # Get API_KEY wrapped in token - using environment variables +# client_id = os.getenv("CISCO_CLIENT_ID") +# client_secret = os.getenv("CISCO_CLIENT_SECRET") +# +# url = "https://id.cisco.com/oauth2/default/v1/token" +# +# payload = "grant_type=client_credentials" +# value = base64.b64encode(f"{client_id}:{client_secret}".encode("utf-8")).decode("utf-8") +# headers = { +# "Accept": "*/*", +# "Content-Type": "application/x-www-form-urlencoded", +# "Authorization": f"Basic {value}", +# } +# +# token_response = requests.request("POST", url, headers=headers, data=payload) +# print(token_response.text) +# token_data = token_response.json() +# +# circuit_client = AzureOpenAI( +# azure_endpoint="https://chat-ai.cisco.com", +# api_key=token_data.get("access_token"), +# api_version="2024-12-01-preview", +# ) +# +# circuit_app_key = os.getenv("CISCO_OPENAI_APP_KEY") +# +# print("βœ… CircuIT API configured successfully!") + + +# ============================================ +# πŸ”§ HELPER FUNCTIONS +# ============================================ + +def _extract_text_from_blocks(blocks): + """Extract text content from response blocks returned by the API.""" + parts = [] + for block in blocks: + text_val = getattr(block, "text", None) + if isinstance(text_val, str): + parts.append(text_val) + elif isinstance(block, dict): + t = block.get("text") + if isinstance(t, str): + parts.append(t) + return "\n".join(parts) + + +def get_openai_completion(messages, model=None, temperature=0.0): + """Get completion from OpenAI models via GitHub Copilot.""" + if model is None: + model = OPENAI_DEFAULT_MODEL + try: + response = openai_client.chat.completions.create( + model=model, + messages=messages, + temperature=temperature + ) + return response.choices[0].message.content + except Exception as e: + return f"❌ Error: {e}\nπŸ’‘ Make sure GitHub Copilot proxy is running on port 7711" + + +def get_claude_completion(messages, model=None, temperature=0.0): + """Get completion from Claude models via GitHub Copilot.""" + if model is None: + model = CLAUDE_DEFAULT_MODEL + try: + response = claude_client.messages.create( + model=model, + max_tokens=8192, + messages=messages, + temperature=temperature + ) + return _extract_text_from_blocks(getattr(response, "content", [])) + except Exception as e: + return f"❌ Error: {e}\nπŸ’‘ Make sure GitHub Copilot proxy is running on port 7711" + + +def get_circuit_completion(messages, model=None, temperature=0.0): + """Get completion from CircuIT APIs (Azure OpenAI via Cisco).""" + if circuit_client is None or circuit_app_key is None: + return "❌ Error: CircuIT not configured\nπŸ’‘ Uncomment Option C in setup_utils.py and set your CircuIT credentials" + + if model is None: + model = CIRCUIT_DEFAULT_MODEL + try: + response = circuit_client.chat.completions.create( + model=model, + messages=messages, + temperature=temperature, + user=f'{{"appkey": "{circuit_app_key}"}}' # CircuIT requires app_key in user field + ) + return response.choices[0].message.content + except Exception as e: + return f"❌ Error: {e}\nπŸ’‘ Check your CircuIT credentials and connection" + + +def get_chat_completion(messages, model=None, temperature=0.0): + """ + Generic function to get chat completion from any provider. + Routes to the appropriate provider-specific function based on PROVIDER setting. + """ + if PROVIDER.lower() == "claude": + return get_claude_completion(messages, model, temperature) + elif PROVIDER.lower() == "circuit": + return get_circuit_completion(messages, model, temperature) + else: # Default to OpenAI + return get_openai_completion(messages, model, temperature) + + +def get_default_model(): + """Get the default model for the current provider.""" + if PROVIDER.lower() == "claude": + return CLAUDE_DEFAULT_MODEL + elif PROVIDER.lower() == "circuit": + return CIRCUIT_DEFAULT_MODEL + else: # Default to OpenAI + return OPENAI_DEFAULT_MODEL + + +# ============================================ +# πŸ§ͺ ACTIVITY and SOLUTION TESTING FUNCTIONS +# ============================================ + +def extract_template_from_activity(activity_file): + """ + Extract the prompt template from an activity markdown file. + Looks for content between: and + + Args: + activity_file: Path to the activity .md file + + Returns: + tuple: (template_text, error_message) + """ + try: + file_path = Path(activity_file) + if not file_path.exists(): + return None, f"❌ File not found: {activity_file}" + + content = file_path.read_text() + + # Extract template between markers + match = re.search( + r'(.*?)', + content, + re.DOTALL + ) + + if match: + template = match.group(1).strip() + # Remove markdown code block markers if present + # template = re.sub(r'^```\w*\n', '', template) + # template = re.sub(r'\n```$', '', template) + return template, None + else: + return None, "⚠️ Template markers not found. Make sure your template is between:\n \n " + + except Exception as e: + return None, f"❌ Error reading file: {e}" + + +def test_activity(activity_file, test_code=None, variables=None, auto_save=True): + """ + Test your activity template directly from the .md file. + + IMPORTANT: Complete your activity template BEFORE running this function! + - Open the activity file (e.g., 'activities/activity-3.1-code-review.md') + - Replace all comments with your actual content + - Fill in role, guidelines, tasks, and output format sections + - Save the file, then run this test function + + Args: + activity_file: Path to your activity file (e.g., 'activities/activity-3.1-code-review.md') + test_code: Optional code sample to review (uses example from file if not provided) + variables: Optional dict of template variables (e.g., {'tech_stack': 'Python', 'repo_name': 'my-app'}) + auto_save: If True, prompts to save result back to activity file + + Returns: + The AI's response + """ + print("="*70) + print("πŸ§ͺ TESTING YOUR ACTIVITY TEMPLATE") + print("="*70) + print("\n⚠️ REMINDER: Make sure you've completed your template first!") + print(" (Replace all comments with actual content)") + + # Extract template + print(f"\nπŸ“– Reading template from: {activity_file}") + template, error = extract_template_from_activity(activity_file) + + if error or template is None: + print(error if error else "❌ Error: Template extraction failed") + return None + + print("βœ… Template loaded successfully!") + print(f"πŸ“ Template length: {len(template)} characters\n") + + # Substitute variables if provided + if variables: + print("πŸ”„ Substituting template variables...") + for key, value in variables.items(): + placeholder = "{{" + key + "}}" + template = template.replace(placeholder, str(value)) + print(f" β€’ {placeholder} β†’ {value}") + print() + + # Add test code if provided + if test_code: + print("πŸ“ Using provided test code\n") + # Replace common placeholders + template = template.replace("{{code_diff}}", test_code) + template = template.replace("{{code}}", test_code) + template = template.replace("{{code_sample}}", test_code) + + # Execute prompt + print("πŸ€– Sending to AI model...") + print("-"*70) + + try: + messages = [{"role": "user", "content": template}] + response = get_chat_completion(messages) + + # Check if response contains error message + if response and ("❌ Error" in response or "Error:" in response): + print("\n" + response) + print("-"*70) + print("\n⚠️ AI request failed. Please check:") + print(" β€’ GitHub Copilot proxy is running (for Option A)") + print(" β€’ API keys are configured correctly (for Options B/C)") + print(" β€’ PROVIDER setting matches your active option") + print(" β€’ Template contains valid content") + return None + + print(response) + print("-"*70) + + # Save result back to activity file + if auto_save and response: + print("\n" + "="*70) + print("πŸ“ SAVE RESULT") + print("="*70) + print("⬆️ LOOK AT THE TOP OF YOUR IDE FOR THE INPUT BOX! ⬆️") + print(" Type 'y' or 'n' in the input field at the top of the screen") + print("="*70) + try: + save_result = input("πŸ’Ύ Save this result to your activity file? (y/n): ") + if save_result.lower() == 'y': + save_test_result(activity_file, test_code, response) + print("βœ… Result saved to activity file!") + else: + print("⏭️ Result not saved. You can run this test again anytime.") + except Exception as save_error: + print(f"⚠️ Could not save result: {save_error}") + + return response + + except Exception as e: + print(f"\n❌ Unexpected error during AI request: {e}") + print("-"*70) + print("\nπŸ’‘ Troubleshooting:") + print(" β€’ Verify your API configuration is correct") + print(" β€’ Check that template placeholders are properly filled") + print(" β€’ Ensure your selected provider is available") + return None + + +def save_test_result(activity_file, test_code, response): + """Save test results back to the activity file.""" + file_path = Path(activity_file) + content = file_path.read_text() + original_content = content + + # Find the test results section and update it + timestamp = __import__('datetime').datetime.now().strftime("%Y-%m-%d %H:%M:%S") + cleaned_response = response.strip() + result_block = ( + f"\n" + f"{cleaned_response}\n" + f"" + ) + updated = False + + # Replace existing result or insert after "Your template's output:" + if '', + result_block, + content, + flags=re.DOTALL + ) + updated = True + else: + # Find where to insert (after various possible markers) + patterns = [ + r"(\*\*Your template's output:\*\*\s*```[^\n]*\n)", + r"(\*\*Output:\*\*\s*```[^\n]*\n)", + r"(### Test Results\s*\n)" + ] + for pattern in patterns: + if re.search(pattern, content): + content = re.sub(pattern, r'\1' + result_block + '\n', content) + updated = True + break + + if not updated: + # Append a new Test Results section at the end if no markers were found + append_block = "\n\n### Test Results\n" + result_block + "\n" + content = content.rstrip() + append_block + updated = True + + if content == original_content: + raise RuntimeError("No changes were applied while attempting to save the test result.") + + file_path.write_text(content) + + +def list_activities(): + """Show available activities to test.""" + activities_dir = Path('activities') + + if not activities_dir.exists(): + print("❌ Activities directory not found") + print("πŸ’‘ Make sure you're running from the module-03-applications directory") + return + + print("="*70) + print("πŸ“š AVAILABLE ACTIVITIES") + print("="*70) + + activity_files = sorted(activities_dir.glob('activity-*.md')) + + if not activity_files: + print("⚠️ No activity files found") + return + + for i, file in enumerate(activity_files, 1): + # Extract title from file + try: + content = file.read_text() + title_match = re.search(r'^# (.+)$', content, re.MULTILINE) + title = title_match.group(1) if title_match else file.name + + print(f"{i}. {file.name}") + print(f" {title}") + print() + except: + print(f"{i}. {file.name}") + print() + + print("="*70) + print("πŸ’‘ Usage: test_activity('activities/activity-3.1-code-review.md')") + + +# Quick access functions for each activity +def test_activity_3_1(test_code=None, variables=None): + """ + Quick helper for Activity 3.1: Comprehensive Code Review + + IMPORTANT: Complete your template in the activity file BEFORE running this! + """ + return test_activity('activities/activity-3.1-code-review.md', test_code=test_code, variables=variables) + + +def test_activity_3_1_solution(test_code=None, variables=None): + """ + Test the provided solution for Activity 3.1: Comprehensive Code Review + + Use this to see how the solution template works before building your own. + Note: auto_save is disabled for solution files to keep them as clean references. + """ + return test_activity('solutions/activity-3.1-code-review-solution.md', test_code=test_code, variables=variables, auto_save=False) + + +def test_activity_3_2(test_code=None, variables=None): + """ + Quick helper for Activity 3.2: Test Generation + + IMPORTANT: Complete your template in the activity file BEFORE running this! + """ + return test_activity('activities/activity-3.2-test-generation.md', test_code=test_code, variables=variables) + + +def test_activity_3_2_solution(test_code=None, variables=None): + """ + Test the provided solution for Activity 3.2: Test Generation + + Use this to see how the solution template works before building your own. + Note: auto_save is disabled for solution files to keep them as clean references. + """ + return test_activity('solutions/activity-3.2-test-generation-solution.md', test_code=test_code, variables=variables, auto_save=False) + + +# ============================================ +# πŸ§ͺ CONNECTION TEST +# ============================================ + +def test_connection(): + """Test connection to AI services.""" + print("πŸ”„ Testing connection to GitHub Copilot proxy...") + test_result = get_chat_completion([ + {"role": "user", "content": "Say 'Connection successful!' if you can read this."} + ]) + + if test_result and ("successful" in test_result.lower() or "success" in test_result.lower()): + print(f"βœ… Connection successful! Using {PROVIDER.upper()} provider with model: {get_default_model()}") + print(f"πŸ“ Response: {test_result}") + return True + else: + print("⚠️ Connection test completed but response unexpected:") + print(f"πŸ“ Response: {test_result}") + return False + + +# ============================================ +# πŸ“Š MODULE INITIALIZATION +# ============================================ + +# Mark module as loaded for checking in notebooks +import sys +sys.modules['__module3_setup__'] = sys.modules[__name__] + +print("βœ… Module 3 setup utilities loaded successfully!") +print(f"πŸ€– Provider: {PROVIDER.upper()}") +print(f"πŸ“ Default model: {get_default_model()}") diff --git a/01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md b/01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md new file mode 100644 index 0000000..0c1f636 --- /dev/null +++ b/01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md @@ -0,0 +1,349 @@ +# Activity 3.1 Solution: Comprehensive Code Review Template + +**⏱️ Completion Time:** Reference solution +**🎯 Focus:** Multi-dimensional code review (security, performance, quality, best practices) +**πŸ“š Ready to Test:** Use `test_activity_3_1_solution()` with this file path + +--- + +## 🎯 Complete Working Solution + +This solution demonstrates a **production-ready comprehensive code review template** that evaluates code across multiple dimensions: security, performance, maintainability, and best practices. + +### How to Test This Solution + +```python +# In 3.2-code-review-automation.ipynb +from setup_utils import test_activity + +# Test the solution template +test_activity( + 'solutions/activity-3.1-code-review-solution.md', + test_code = """ ++ import hashlib ++ import time ++ ++ SESSION_CACHE = {} ++ ++ def authenticate_user(db, username, password): ++ username = username or "" ++ password = password or "" ++ ++ query = f"SELECT id, password_hash, failed_attempts FROM users WHERE username = '{username}'" ++ rows = db.execute(query) ++ user = rows[0] ++ ++ hashed = hashlib.md5(password.encode()).hexdigest() ++ ++ if hashed != user["password_hash"]: ++ db.execute(f"UPDATE users SET failed_attempts = {user['failed_attempts'] + 1} WHERE id = {user['id']}") ++ return {"status": "error"} ++ ++ if username not in SESSION_CACHE: ++ SESSION_CACHE[username] = f"{user['id']}-{int(time.time())}" ++ ++ permissions = [] ++ for role in db.fetch_roles(): ++ if db.has_role(user["id"], role["id"]): ++ permissions.append(role["name"]) ++ ++ time.sleep(0.5) ++ db.write_audit_entry(user["id"], username) ++ ++ return {"status": "ok", "session": SESSION_CACHE[username], "permissions": permissions} +""", + variables={ + 'tech_stack': 'Python', + 'repo_name': 'user-auth-service', + 'service_name': 'Authentication API', + 'change_purpose': 'Add user login endpoint' + } +) +``` + +--- + +## πŸ‘‡ COMPLETE WORKING TEMPLATE BELOW πŸ‘‡ + +```xml +/******************************************************************************* + * SOLUTION TEMPLATE FOR ACTIVITY 3.1 + * + * This is a complete, production-ready comprehensive code review template. + * Focus areas: + * - Security (common vulnerabilities) + * - Performance (efficiency and optimization) + * - Code Quality (readability and maintainability) + * - Best Practices (design patterns and idioms) + * - Inspired by AWS patterns without over-complicating + ******************************************************************************/ + + + +You are a Senior Software Engineer specializing in {{tech_stack}} with expertise in: +- Code quality and software design +- Security best practices +- Performance optimization +- Clean code principles and maintainability + + + +Repository: {{repo_name}} +Service: {{service_name}} +Change Purpose: {{change_purpose}} +Review Focus: Comprehensive evaluation across security, performance, quality, and best practices + + + +{{code_diff}} + + + +Conduct a systematic code review across these dimensions: + +1. **Security** + - Input validation and sanitization + - Common vulnerabilities (SQL injection, XSS, authentication issues) + - Sensitive data handling + - Error messages that leak information + +2. **Performance** + - Algorithm efficiency and complexity + - Database query optimization + - Unnecessary operations or redundant code + - Resource usage (memory, I/O) + +3. **Error Handling** + - Proper exception handling + - Edge case coverage + - Graceful degradation + - User-friendly error messages + +4. **Code Quality** + - Readability and clarity + - Code organization and structure + - DRY principle (Don't Repeat Yourself) + - Naming conventions + +5. **Correctness** + - Logic accuracy + - Edge cases and boundary conditions + - Expected vs actual behavior + +6. **Best Practices** + - Language idioms and conventions + - Design patterns + - SOLID principles + - Testability + +For each finding: +- Reference specific line numbers +- Explain the issue and its impact +- Provide concrete fixes with code examples +- Keep explanations practical and actionable + + + +Step 1 - Systematic Analysis: Review the code across all dimensions. + Ask yourself: + β€’ Are there security vulnerabilities? + β€’ Could performance be improved? + β€’ Is error handling comprehensive? + β€’ Is the code maintainable and readable? + β€’ Does it follow best practices? + +Step 2 - Categorize Findings: For each issue: + β€’ Severity: CRITICAL | MAJOR | MINOR | INFO + β€’ Category: Security / Performance / Quality / Correctness / Best Practices + β€’ Line: Specific line number + β€’ Impact: Why it matters + β€’ Solution: Concrete fix with example + +Step 3 - Provide Verdict: Overall assessment with clear recommendation + + + +Provide your comprehensive review in this format: + +## Code Review Summary +[Brief overview of the change and general assessment] + +## Findings + +### [SEVERITY] Issue Title +**Category:** [Security / Performance / Quality / Correctness / Best Practices] +**Lines:** [specific line numbers] +**Issue:** [Clear description of the problem] +**Impact:** [Why this matters - user impact, technical debt, security risk, etc.] +**Recommendation:** +```code +[Concrete fix with example code] +``` + +[Repeat for each finding, ordered by severity] + +## Positive Observations +[What was done well - reinforce good practices] + +## Overall Assessment +**Recommendation:** [APPROVE | REQUEST CHANGES | NEEDS WORK] +**Summary:** [Key takeaways and required actions before merge] + + + +/******************************************************************************* + * SOLUTION TEMPLATE ENDS HERE + * + * Expected findings for the test code: + * - Security: SQL injection via string-formatted query + * - Security: Weak password hashing (MD5) + * - Security: Predictable session tokens generated from timestamps + * - Performance: Blocking sleep call and N+1 role lookups + * - Maintainability/Correctness: Missing empty-result handling and reliance on global session cache + ******************************************************************************/ +``` + +--- + +## βœ… What Makes This Solution Comprehensive + +### 1. Multi-Dimensional Role + +```xml + +You are a Senior Software Engineer specializing in {{tech_stack}} with expertise in: +- Code quality and software design +- Security best practices +- Performance optimization + +``` + +**Why this works:** +- Establishes broad expertise across multiple areas +- Not overly specialized (vs just "Security Engineer") +- Appropriate seniority for making architectural decisions + +### 2. Balanced Review Guidelines + +The template covers 6 key dimensions without going too deep into any single one: +- βœ… Security (common issues, not exhaustive penetration testing) +- βœ… Performance (practical optimization, not micro-optimization) +- βœ… Error Handling (comprehensive coverage) +- βœ… Code Quality (readability and maintainability) +- βœ… Correctness (logic verification) +- βœ… Best Practices (language-specific idioms) + +**Why this works:** +- Comprehensive without being overwhelming +- Practical focus on common issues +- Actionable for most development teams + +### 3. Clear Severity Levels + +``` +CRITICAL - Security vulnerabilities, data loss risks +MAJOR - Bugs, significant performance issues +MINOR - Code quality, maintainability concerns +INFO - Suggestions, nice-to-haves +``` + +**Why this works:** +- Easy to understand and apply +- Helps prioritize fixes +- Aligns with common development practices + +### 4. Structured Output with Categories + +```xml +### [SEVERITY] Issue Title +**Category:** [Security / Performance / Quality] +**Impact:** [Why it matters] +**Recommendation:** [Concrete fix] +``` + +**Why this works:** +- Makes findings easy to scan and understand +- Categories help route issues to right experts +- Impact explanation justifies the work +- Concrete recommendations enable quick fixes + +### 5. Positive Observations Section + +The template includes a section for positive feedback: +``` +## Positive Observations +[What was done well] +``` + +**Why this works:** +- Reinforces good practices +- Balances constructive criticism +- Encourages developers +- Creates a learning opportunity + +--- + +## πŸ”‘ Key Differences from Security-Only Review + +| Aspect | Security-Only Review | Comprehensive Review | +|--------|---------------------|---------------------| +| **Focus Areas** | Vulnerabilities, attack vectors | Security + Performance + Quality + Best Practices | +| **Role** | Security Engineer, AppSec Specialist | Senior Software Engineer with broad expertise | +| **Depth** | Deep dive into security (OWASP, CWE) | Balanced across multiple dimensions | +| **Standards** | OWASP Top 10, CWE, CVE | Language best practices, design patterns, common sense | +| **Severity** | Based on exploitability | Based on impact across all dimensions | +| **Use Case** | Security-critical changes | General code review | + +--- + +## πŸ“‹ Expected Findings for Test Code + +When you run this template on the test authentication code, you should see: + +### CRITICAL: SQL Injection in User Lookup +**Category:** Security +**Lines:** 8 +**Issue:** Directly interpolating `username` into the SQL query enables injection attacks +**Impact:** Attackers can bypass authentication or read/modify arbitrary user data +**Fix:** Use parameterized queries or ORM helpers with bound parameters + +### CRITICAL: Weak Password Hashing (MD5) +**Category:** Security +**Lines:** 12 +**Issue:** MD5 is a broken hashing algorithm that is trivial to crack +**Impact:** Leaked hashes expose all user passwords within minutes +**Fix:** Replace with a password hashing library (e.g., `bcrypt`, `argon2`) and add salting + +### MAJOR: Predictable Session Tokens +**Category:** Security / Best Practices +**Lines:** 18 +**Issue:** Session tokens are generated from user id + seconds timestamp, making them guessable +**Impact:** Attackers can hijack active sessions by predicting token values +**Fix:** Use cryptographically secure random token generation + +### MAJOR: Performance Bottlenecks in Permission Loading +**Category:** Performance +**Lines:** 21-24, 26 +**Issue:** Loops fetch each role individually (`db.has_role`) and introduce a blocking `time.sleep(0.5)` per login +**Impact:** Causes N+1 database access patterns and unnecessary latency under load +**Fix:** Batch-fetch permissions and remove or justify the artificial sleep + +### MAJOR: Fragile Handling of Query Results and Global State +**Category:** Maintainability / Correctness +**Lines:** 9, 16, 27 +**Issue:** Accessing `rows[0]` without checking for empty results raises exceptions, and a module-level cache couples sessions to process memory +**Impact:** Leads to crashes for unknown users, memory leaks, and inconsistent state in multi-instance deployments +**Fix:** Add explicit handling for missing users and replace global cache with scoped session storage + +### Quick Reference: Issues by Dimension + +| Dimension | Signals in Sample Diff | Suggested Focus | +|-----------|------------------------|-----------------| +| Security | SQL injection query, MD5 hashing, timestamp-based session tokens | Parameterized queries, modern hashing algorithms, secure token generation | +| Performance | N+1 role lookups, `time.sleep(0.5)` | Batch permission fetches, remove blocking calls | +| Maintainability / Correctness | `rows[0]` without checks, global `SESSION_CACHE` | Defensive coding for empty data, encapsulated session management | +| Error Handling | No validation for DB results or exceptions | Add guard clauses, return meaningful error responses | + +--- + +**Remember**: This template is a **first-pass review tool** to catch common issues and provide helpful feedback. It complements but doesn't replace human code review, especially for complex architectural decisions. From 7b451228ead44f7040af4be3839c7782e4614cdd Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Fri, 17 Oct 2025 18:14:20 +0530 Subject: [PATCH 4/6] Update activities and solutions for Module 3: Code Review and Test Generation - Renumbered activities to reflect new structure, updating references from Activity 3.1 to 3.2 and from Activity 3.2 to 3.3. - Added new activity files for 3.2 (Code Review Template) and 3.3 (Test Generation Template) with detailed instructions and templates. - Implemented helper functions in `setup_utils.py` for testing activities, ensuring backward compatibility for previous activity references. - Created solution files for both activities, providing comprehensive examples and best practices for code reviews and test generation. - Updated README files to include new activities and solutions, enhancing clarity on objectives and prerequisites. --- .../3.1-setup-and-introduction.ipynb | 4 +- .../3.2-code-review-automation.ipynb | 16 +- .../3.3-test-generation-automation.ipynb | 497 ++++++++++++++++++ .../activities/README.md | 241 +++++++++ ...-review.md => activity-3.2-code-review.md} | 12 +- .../activity-3.3-test-generation.md | 300 +++++++++++ .../module-03-applications/setup_utils.py | 46 +- .../solutions/README.md | 102 ++++ ...d => activity-3.2-code-review-solution.md} | 8 +- .../activity-3.3-test-generation-solution.md | 176 +++++++ 10 files changed, 1367 insertions(+), 35 deletions(-) create mode 100644 01-course/module-03-applications/3.3-test-generation-automation.ipynb create mode 100644 01-course/module-03-applications/activities/README.md rename 01-course/module-03-applications/activities/{activity-3.1-code-review.md => activity-3.2-code-review.md} (96%) create mode 100644 01-course/module-03-applications/activities/activity-3.3-test-generation.md create mode 100644 01-course/module-03-applications/solutions/README.md rename 01-course/module-03-applications/solutions/{activity-3.1-code-review-solution.md => activity-3.2-code-review-solution.md} (98%) create mode 100644 01-course/module-03-applications/solutions/activity-3.3-test-generation-solution.md diff --git a/01-course/module-03-applications/3.1-setup-and-introduction.ipynb b/01-course/module-03-applications/3.1-setup-and-introduction.ipynb index a6f8855..35775b3 100644 --- a/01-course/module-03-applications/3.1-setup-and-introduction.ipynb +++ b/01-course/module-03-applications/3.1-setup-and-introduction.ipynb @@ -168,8 +168,8 @@ "Activity Testing Functions:\n", "
    \n", "
  • test_activity(file, code, variables) - Test any activity template
  • \n", - "
  • test_activity_3_1(code, variables) - Quick test for Activity 3.1
  • \n", "
  • test_activity_3_2(code, variables) - Quick test for Activity 3.2
  • \n", + "
  • test_activity_3_3(code, variables) - Quick test for Activity 3.3
  • \n", "
  • list_activities() - Show available activities
  • \n", "
\n", "\n", @@ -275,7 +275,7 @@ "1. Open [`3.2-code-review-automation.ipynb`](./3.2-code-review-automation.ipynb)\n", "2. The setup will already be loaded - just import from `setup_utils`!\n", "3. Learn how to build production-ready code review templates\n", - "4. Complete Activity 3.1 in your own `.md` file\n", + "4. Complete Activity 3.2 in your own `.md` file\n", "\n", "**πŸ’‘ Tip:** Keep this notebook open in case you need to troubleshoot the connection later.\n", "\n", diff --git a/01-course/module-03-applications/3.2-code-review-automation.ipynb b/01-course/module-03-applications/3.2-code-review-automation.ipynb index de8016a..c2e20c5 100644 --- a/01-course/module-03-applications/3.2-code-review-automation.ipynb +++ b/01-course/module-03-applications/3.2-code-review-automation.ipynb @@ -392,7 +392,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## πŸ‹οΈ Hands-On Practice: Activity 3.1\n", + "## πŸ‹οΈ Hands-On Practice: Activity 3.2\n", "\n", "
\n", "πŸ“ Activity Time: Work in Your Own File!

\n", @@ -414,13 +414,13 @@ "\n", "### πŸ“ Instructions\n", "\n", - "1. **Open the activity file:** [`activities/activity-3.1-code-review.md`](./activities/activity-3.1-code-review.md)\n", + "1. **Open the activity file:** [`activities/activity-3.2-code-review.md`](./activities/activity-3.2-code-review.md)\n", "2. **Follow the 3-step process:**\n", " - **Step 1 (10-15 min):** Research AWS code review patterns\n", " - **Step 2 (10-15 min):** Design your template (answer planning questions)\n", " - **Step 3 (15-20 min):** Build your template between the markers\n", "3. **Test your template** using the helper function below\n", - "4. **Compare with solution** when done: [`solutions/activity-3.1-code-review-solution.md`](./solutions/activity-3.1-code-review-solution.md)\n", + "4. **Compare with solution** when done: [`solutions/activity-3.2-code-review-solution.md`](./solutions/activity-3.2-code-review-solution.md)\n", "\n", "### πŸ§ͺ Testing Your Activity\n", "\n", @@ -445,7 +445,7 @@ "

\n", "Steps to complete first:\n", "
    \n", - "
  • Open activities/activity-3.1-code-review.md
  • \n", + "
  • Open activities/activity-3.2-code-review.md
  • \n", "
  • Replace all <!-- TODO: ... --> comments with your actual content
  • \n", "
  • Fill in role, guidelines, tasks, and output format sections
  • \n", "
  • Save the file, then come back and run the cell below
  • \n", @@ -470,7 +470,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Test your Activity 3.1 template\n", + "# Test your Activity 3.2 template\n", "\n", "# This is the vulnerable authentication code from the activity\n", "test_code = \"\"\"\n", @@ -508,7 +508,7 @@ "\"\"\"\n", "\n", "# Run this to test your template from the activity file\n", - "test_activity_3_1(\n", + "test_activity_3_2(\n", " test_code=test_code,\n", " variables={\n", " 'tech_stack': 'Python',\n", @@ -519,7 +519,7 @@ ")\n", "\n", "# The function will:\n", - "# 1. Read your template from activities/activity-3.1-code-review.md\n", + "# 1. Read your template from activities/activity-3.2-code-review.md\n", "# 2. Substitute the variables\n", "# 3. Send to the AI model\n", "# 4. Display the results\n", @@ -561,7 +561,7 @@ "**Next up**\n", "1. Open [`3.3-test-generation-automation.ipynb`](./3.3-test-generation-automation.ipynb)\n", "2. Use the same setup to explore LLM-powered test generation\n", - "3. Complete Activity 3.2 in its markdown workspace\n", + "3. Complete Activity 3.3 in its markdown workspace\n", "\n", "\n", "
    \n", diff --git a/01-course/module-03-applications/3.3-test-generation-automation.ipynb b/01-course/module-03-applications/3.3-test-generation-automation.ipynb new file mode 100644 index 0000000..8ac6d49 --- /dev/null +++ b/01-course/module-03-applications/3.3-test-generation-automation.ipynb @@ -0,0 +1,497 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Section 3.3: Test Generation Automation\n", + "\n", + "| **Aspect** | **Details** |\n", + "|-------------|-------------|\n", + "| **Goal** | Build production-ready test generation templates that surface coverage gaps and ambiguities |\n", + "| **Time** | ~35 minutes |\n", + "| **Prerequisites** | Sections 1-2 complete, setup_utils.py loaded |\n", + "| **What You'll Learn** | Coverage gap analysis, ambiguity detection, reusable test specifications |\n", + "| **Next Steps** | Continue to Section 3.4: Code Review Automation |\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ Quick Setup Check\n", + "\n", + "Since you completed Section 1, setup is already done! We just need to import it.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Quick setup check - imports setup_utils\n", + "try:\n", + " import importlib\n", + " import setup_utils\n", + " importlib.reload(setup_utils)\n", + " from setup_utils import *\n", + " print(f\"βœ… Setup loaded! Using {PROVIDER.upper()} with {get_default_model()}\")\n", + " print(\"πŸš€ Ready to build test generation templates!\")\n", + "except ImportError:\n", + " print(\"❌ Setup not found!\")\n", + " print(\"πŸ’‘ Please run 3.1-setup-and-introduction.ipynb first to set up your environment.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ§ͺ Test Generation Automation Template\n", + "\n", + "### Building a Comprehensive Test Generation Prompt with a Multi-Tactic Stack\n", + "\n", + "
    \n", + "🎯 What You'll Build in This Section

    \n", + "\n", + "You'll create a production-ready test generation prompt template that analyzes vague requirements, surfaces coverage gaps, and produces reusable specs across unit and integration tests.\n", + "\n", + "Time Required: ~35 minutes (learning + examples + activity)\n", + "
    \n", + "\n", + "Test generation makes the model juggle requirements, existing coverage, and missing scenarios at the same time. We'll reuse Module 2 tactics so the flow moves from context β†’ analysis β†’ gap filling without losing track of dependencies.\n", + "\n", + "#### 🎯 The Problem We're Solving\n", + "\n", + "Manual test planning faces three critical challenges:\n", + "\n", + "1. **πŸ“‹ Incomplete Coverage**\n", + " - Easy to miss edge cases and error paths\n", + " - Boundary conditions often overlooked (0%, 100%, empty inputs)\n", + " - Security and performance scenarios fall through the cracks\n", + " - **Impact:** Bugs slip through to production and erode trust\n", + "\n", + "2. **⏰ Time Pressure**\n", + " - Testing gets squeezed at the end of sprints\n", + " - QA teams struggle to keep up with feature velocity\n", + " - Documentation for test planning is often rushed or skipped\n", + " - **Impact:** Technical debt builds up inside the test suite\n", + "\n", + "3. **🎲 Missed Ambiguities**\n", + " - Unclear requirements don't get questioned until implementation\n", + " - Assumptions creep in without validation\n", + " - Integration points and dependencies surface late\n", + " - **Impact:** Rework, missed deadlines, and scope creep\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 🧩 Pattern Overview: AWS-Inspired Flow\n", + "\n", + "The AWS `generate-tests` command pattern distills test planning into four focused moves. Keeping each move small reduces cognitive load while preserving coverage rigor:\n", + "\n", + "1. **Command Summary** – clarify the mission, the success signals, and who consumes the plan.\n", + "2. **System Inputs** – bundle the minimal context (domain, requirements, existing tests) so the model scans once.\n", + "3. **Reasoning Checklist** – guide how the model should think before it writes any specs.\n", + "4. **Output Contract** – lock the deliverables into a predictable structure for downstream tooling.\n", + "\n", + "
    \n", + "AWS-Inspired Callouts

    \n", + "
      \n", + "
    • Keep the command summary to three bullets so the north star stays visible.
    • \n", + "
    • Group all inputs together to avoid jumping between sections.
    • \n", + "
    • Use a checklistβ€”short, ordered stepsβ€”to enforce deliberate reasoning.
    • \n", + "
    • Let the output contract mirror the sections your QA automation expects.
    • \n", + "
    \n", + "
    \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ—‚οΈ Command Template Walkthrough\n", + "\n", + "Follow the four sections in order. Swap in your project variables and keep each bullet list tight so learners (and the model) stay oriented.\n", + "\n", + "**Where the tactics show up in the template:**\n", + "\n", + "| Template Block | What It Does | Tactic Used |\n", + "| --- | --- | --- |\n", + "| **``** | Frames intent, consumers, and success signals | Role prompting & guardrails |\n", + "| **``** with **``**, **``**, **``** | Bundles domain, requirements, and existing coverage for one-pass reading | Structured inputs |\n", + "| **``** | Forces deliberate analysis before drafting specs | Task decomposition + chain-of-thought |\n", + "| **``** | Locks deliverables into an automation-friendly format | Structured output |\n", + "\n", + "Use these four tags to tune the prompt for any feature: update `` with your mission and success signals, fill `` with the latest project context, sharpen `` questions to match risk areas, and rework `` sections so the output lands exactly where your QA tooling expects.\n", + "\n", + "```xml\n", + "\n", + "\n", + "Command: Generate a coverage-focused test plan before sprint planning.\n", + "Primary Objective: Expose untested scenarios so the QA team can prioritise automation.\n", + "Success Signals:\n", + "- Each critical flow has at least one unit or integration test candidate.\n", + "- Ambiguities and blockers are captured for follow-up.\n", + "- Output structure stays automation-ready for QA tooling.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "{project_context}\n", + "\n", + "\n", + "{functional_requirements}\n", + "\n", + "\n", + "{existing_tests}\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "1. Summarise the product slice in two bullets to anchor context.\n", + "2. Compare requirements against existing tests and note risk themes.\n", + "3. Log ambiguities or missing business rules that block automation.\n", + "4. Expand uncovered scenarios into test specifications (unit or integration) with setup, steps, and expected results before writing the final output.\n", + "\n", + "\n", + "\n", + "\n", + "## Summary\n", + "- Product goal\n", + "- High-risk areas\n", + "\n", + "## Ambiguities & Follow-ups\n", + "- [Question]\n", + "- [Question]\n", + "\n", + "## Coverage Map\n", + "| Theme | Risk Level | Missing Scenario |\n", + "| --- | --- | --- |\n", + "\n", + "## Unit Tests\n", + "### Test: [Name]\n", + "**Goal:** [Purpose]\n", + "**Setup:** [Data, mocks]\n", + "**Steps:**\n", + "1. ...\n", + "**Expected:** ...\n", + "\n", + "## Integration Tests\n", + "### Test: [Name]\n", + "**Goal:** [Purpose]\n", + "**Setup:** [Services, data]\n", + "**Steps:**\n", + "1. ...\n", + "**Expected:** ...\n", + "\n", + "## Test Data & Tooling\n", + "- Fixtures, environments, monitoring hooks required.\n", + "\n", + "```\n", + "\n", + "This command template keeps the analysis, gap spotting, and deliverables in one tight flow. Learners can adapt it by swapping out the variables while the structure stays stable.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ’» Working Example: Payment Service Test Generation\n", + "\n", + "Now let's watch the command-style template drive a full test planning session for a payment processing service.\n", + "\n", + "**What to look for:**\n", + "- Each section is labelled (``) so you can map it back to the walkthrough table.\n", + "- The reasoning checklist forces the model to analyse coverage gaps before it drafts specs.\n", + "- Unit and integration tests remain separated, and infrastructure needs stay visible at the end.\n", + "\n", + "Run the cell below to see the prompt and response rendered together.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Test Case Generation for Payment Service\n", + "\n", + "functional_requirements = \"\"\"\n", + "Payment Processing Requirements:\n", + "1. Process credit card payments with validation\n", + "2. Handle multiple currencies (USD, EUR, GBP)\n", + "3. Apply discounts and calculate tax\n", + "4. Generate transaction receipts\n", + "5. Handle payment failures and retries (max 3 attempts)\n", + "6. Send confirmation emails on success\n", + "7. Log all transactions for audit compliance\n", + "8. Support payment refunds within 30 days\n", + "\"\"\"\n", + "\n", + "existing_tests = \"\"\"\n", + "Current Test Suite (payment_service_test.py):\n", + "- test_process_valid_payment() - Happy path for USD payments\n", + "- test_invalid_card_number() - Validates card number format\n", + "- test_calculate_tax() - Tax calculation for US region only\n", + "\"\"\"\n", + "\n", + "project_context = \"\"\"\n", + "Domain: FinTech payments platform\n", + "Project: Payment Processing Service\n", + "Primary test framework: pytest\n", + "Tech stack: Python, FastAPI, PostgreSQL\n", + "\"\"\"\n", + "\n", + "command_prompt = f\"\"\"\n", + "\n", + "\n", + "Command: Generate a coverage-focused test plan for the payment processing service.\n", + "Primary Objective: Expose untested scenarios before sprint planning.\n", + "Success Signals:\n", + "- Every critical flow has at least one unit or integration test candidate.\n", + "- Ambiguities and open questions are captured for follow-up.\n", + "- Output structure stays automation-ready for QA tooling.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "{project_context}\n", + "\n", + "\n", + "{functional_requirements}\n", + "\n", + "\n", + "{existing_tests}\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "1. Summarise the product slice in two bullets to anchor context.\n", + "2. Compare requirements against existing tests and note risk themes.\n", + "3. Log ambiguities or missing business rules that block automation.\n", + "4. Expand uncovered scenarios into test specifications (unit or integration) with setup, steps, and expected results before writing the final output.\n", + "\n", + "\n", + "\n", + "\n", + "## Summary\n", + "- Product goal\n", + "- High-risk areas\n", + "\n", + "## Ambiguities & Follow-ups\n", + "- [Question]\n", + "- [Question]\n", + "\n", + "## Coverage Map\n", + "| Theme | Risk Level | Missing Scenario |\n", + "| --- | --- | --- |\n", + "\n", + "## Unit Tests\n", + "### Test: [Name]\n", + "**Goal:** [Purpose]\n", + "**Setup:** [Data, mocks]\n", + "**Steps:**\n", + "1. ...\n", + "**Expected:** ...\n", + "\n", + "## Integration Tests\n", + "### Test: [Name]\n", + "**Goal:** [Purpose]\n", + "**Setup:** [Services, data]\n", + "**Steps:**\n", + "1. ...\n", + "**Expected:** ...\n", + "\n", + "## Test Data & Tooling\n", + "- Fixtures, environments, monitoring hooks required.\n", + "\n", + "\"\"\"\n", + "\n", + "test_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You follow structured QA templates and produce detailed, automation-ready test plans.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": command_prompt\n", + " }\n", + "]\n", + "\n", + "print(\"πŸ“‹ PROMPT PREVIEW\")\n", + "print(\"=\" * 70)\n", + "print(command_prompt)\n", + "print(\"=\" * 70)\n", + "print(\"πŸ§ͺ TEST GENERATION IN PROGRESS...\")\n", + "print(\"=\" * 70)\n", + "test_result = get_chat_completion(test_messages, temperature=0.0)\n", + "print(test_result)\n", + "print(\"=\" * 70)\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ‹οΈ Hands-On Practice: Activity 3.3\n", + "\n", + "
    \n", + "πŸ“ Activity Time: Work in Your Own File!

    \n", + "\n", + "Complete this activity in a dedicated markdown file for a clean, reusable workspace!\n", + "
    \n", + "\n", + "### 🎯 What You'll Build\n", + "\n", + "A production-ready test generation template for a shopping cart discount system with intentionally vague requirements.\n", + "\n", + "**Time Required:** 30-40 minutes\n", + "\n", + "**The Challenge:** These requirements are intentionally vague! Your template should identify ambiguities, generate edge cases, and produce comprehensive test specifications.\n", + "\n", + "### πŸ“ Instructions\n", + "\n", + "1. **Open the activity file:** [`activities/activity-3.3-test-generation.md`](./activities/activity-3.3-test-generation.md)\n", + "2. **Follow the 3-step process:**\n", + " - **Step 1 (10-15 min):** Research AWS test generation patterns\n", + " - **Step 2 (10-15 min):** Design your template (answer planning questions)\n", + " - **Step 3 (15-20 min):** Build your template between the markers\n", + "3. **Test your template** using the helper function below\n", + "4. **Compare with solution** when done: [`solutions/activity-3.3-test-generation-solution.md`](./solutions/activity-3.3-test-generation-solution.md)\n", + "\n", + "### πŸ§ͺ Testing Your Activity\n", + "\n", + "Use the helper function below to test your template directly from the activity file!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Test your Activity 3.3 template\n", + "\n", + "# Shopping Cart Discount System requirements (intentionally vague!)\n", + "discount_requirements = \"\"\"\n", + "Feature: Shopping Cart Discount System\n", + "\n", + "Requirements:\n", + "1. Users can apply discount codes at checkout\n", + "2. Discount types: percentage (10%, 25%, etc.) or fixed amount ($5, $20, etc.)\n", + "3. Each discount code has an expiration date\n", + "4. Usage limits: one-time use OR unlimited\n", + "5. Business rule: Discounts cannot be combined (one per order)\n", + "6. Cart total must be > 0 after discount applied\n", + "7. Fixed discounts cannot exceed cart total\n", + "\"\"\"\n", + "\n", + "existing_tests = \"\"\"\n", + "Current test suite (minimal coverage):\n", + "- test_apply_percentage_discount() - 10% off $100 cart\n", + "- test_fixed_amount_discount() - $5 off $50 cart\n", + "\"\"\"\n", + "\n", + "# Run this to test your template from the activity file\n", + "test_activity_3_3(\n", + " test_code=discount_requirements,\n", + " variables={\n", + " 'domain': 'e-commerce',\n", + " 'project_name': 'Shopping Cart',\n", + " 'tech_stack': 'Python/Flask',\n", + " 'test_framework': 'pytest',\n", + " 'functional_requirements': discount_requirements,\n", + " 'test_suite_overview': existing_tests\n", + " }\n", + ")\n", + "\n", + "# The function will:\n", + "# 1. Read your template from activities/activity-3.3-test-generation.md\n", + "# 2. Substitute the variables\n", + "# 3. Send to the AI model\n", + "# 4. Display the results\n", + "# 5. Asks if you want to save results back to the activity file\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“š Learn More: Advanced Test Generation Patterns\n", + "\n", + "Want to dive deeper into automated test generation?\n", + "\n", + "**πŸ“– AWS Anthropic Advanced Patterns:**\n", + "- [Test Generation Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md) - Production-ready patterns\n", + "\n", + "**πŸ”— Related Best Practices:**\n", + "- [Claude 4 Prompt Engineering](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices)\n", + "- [Prompt Templates and Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βœ… Section 3 Complete!\n", + "\n", + "
    \n", + "πŸŽ‰ Outstanding work! You just wrapped up the Test Generation Automation section and completed Module 3.\n", + "
    \n", + "\n", + "**Key takeaways**\n", + "- Crafted a reusable template that analyzes requirements, finds coverage gaps, and plans infrastructure\n", + "- Practiced separating unit and integration specs while documenting assumptions\n", + "- Reinforced the Module 2 tactic stack inside a real SDLC workflow\n", + "\n", + "**Where to go next**\n", + "1. Revisit your activity files to iterate on the templates with real project requirements\n", + "2. Continue to Module 4 (if available) or integrate these prompts into your team's QA workflow\n", + "3. Share what you builtβ€”pair with teammates or incorporate into CI to keep momentum going\n", + "\n", + "
    \n", + " β˜• Time for a breather?\n", + " Celebrate the milestone, then come back when you're ready to apply these prompts in production.\n", + "
    \n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01-course/module-03-applications/activities/README.md b/01-course/module-03-applications/activities/README.md new file mode 100644 index 0000000..a854478 --- /dev/null +++ b/01-course/module-03-applications/activities/README.md @@ -0,0 +1,241 @@ +# Module 3: Practice Activities + +Complete these hands-on activities to master SDLC prompt engineering. + +--- + +## πŸ“š Activity Overview + +| Activity | Topic | Time | Difficulty | Prerequisites | +|----------|-------|------|------------|---------------| +| [3.2](./activity-3.2-code-review.md) | Build Code Review Template | 30-40 min | ⭐⭐⭐ | `3.2-code-review-automation.ipynb` complete | +| [3.3](./activity-3.3-test-generation.md) | Build Test Generation Template | 30-40 min | ⭐⭐⭐ | `3.3-test-generation-automation.ipynb` complete | + +--- + +## 🎯 How to Complete Activities + +### Step 1: Open the Activity File +Each activity is a separate markdown file. Open it in your editor alongside the notebook. + +### Step 2: Follow the 3-Step Process +Each activity uses the same proven workflow: + +1. **RESEARCH** (10-15 min) - Study real-world patterns from AWS +2. **DESIGN** (10-15 min) - Answer planning questions before coding +3. **BUILD** (15-20 min) - Create your template between the markers + +### Step 3: Edit the Template +Work directly in the activity file between these markers: +```markdown + +[Your template goes here] + +``` + +### Step 4: Test Your Template +Use the helper functions in the notebooks to test your templates: + +```python +# In 3.2-code-review-automation.ipynb +test_activity_3_2(test_code="...", variables={...}) + +# In 3.3-test-generation-automation.ipynb +test_activity_3_3(test_code="...", variables={...}) +``` + +**Results auto-save back to your activity file!** ✨ + +### Step 5: Compare with Solution +Check your work against the official solution in `solutions/` + +**Pro Tip:** Solutions are also testable! Run them to see expected output: + +```python +from setup_utils import test_activity + +# Test the solution to see what "good" looks like +test_activity( + 'solutions/activity-3.2-code-review-solution.md', + test_code="...", + variables={...} +) +``` + +--- + +## πŸ’‘ Tips for Success + +### Before You Start +- βœ… Complete the corresponding notebook section first +- βœ… Read the activity instructions carefully +- βœ… Study the AWS patterns linked in each activity + +### While Working +- πŸ”„ **Iterate quickly** - Test early, test often +- πŸ“ **Take notes** - Capture insights in the Learning Notes section +- 🎯 **Focus on understanding** - Don't just copy patterns, understand why they work +- πŸ€” **Think about edge cases** - What could go wrong? + +### Testing Tips +- Start with the provided test cases +- Try multiple test scenarios +- Check if results meet all success criteria +- Compare output quality with examples in notebooks + +### If You Get Stuck +1. Review the notebook examples +2. Study the AWS pattern more carefully +3. Check the "Common Issues" section in the activity +4. Compare with the solution (but try yourself first!) + +--- + +## πŸ”§ Using the Testing Functions + +### Quick Testing +```python +# Simplest usage +test_activity_3_2() # Uses defaults +``` + +### With Custom Test Code +```python +# Test with your own code +test_code = """ +def vulnerable_function(): + # your code here +""" + +test_activity_3_2(test_code=test_code) +``` + +### With Template Variables +```python +# Customize template variables +test_activity_3_2( + test_code=my_code, + variables={ + 'tech_stack': 'Python/Django', + 'repo_name': 'my-project', + 'service_name': 'api-service' + } +) +``` + +### Test Solutions for Reference +```python +# Test the solution to see expected output +from setup_utils import test_activity + +# Activity 3.2 solution +test_activity( + 'solutions/activity-3.2-code-review-solution.md', + test_code=vulnerable_code, + variables={'tech_stack': 'Python', 'repo_name': 'auth-service'} +) + +# Activity 3.3 solution +test_activity( + 'solutions/activity-3.3-test-generation-solution.md', + test_code=requirements, + variables={'domain': 'E-commerce', 'tech_stack': 'Python/pytest'} +) +``` + +**Why test solutions?** +- See what "good" output looks like +- Compare your results to expert templates +- Learn from production-ready examples + +--- + +## πŸ“Š Track Your Progress + +- [ ] **Setup** (`3.1-setup-and-introduction.ipynb`) + - [ ] Install dependencies + - [ ] Test AI connection + - [ ] Understand module structure + +- [ ] **Activity 3.2: Code Review Template** (30-40 min) + - [ ] Complete: `3.2-code-review-automation.ipynb` + - [ ] Research: AWS code review pattern + - [ ] Design: Answer planning questions in `activity-3.2-code-review.md` + - [ ] Build: Create template between `` markers + - [ ] Test: Run `test_activity_3_2()` with authentication code + - [ ] Compare: Test solution file to see expected output + - [ ] Iterate: Improve based on comparison + +- [ ] **Activity 3.3: Test Generation Template** (30-40 min) + - [ ] Complete: `3.3-test-generation-automation.ipynb` + - [ ] Research: AWS test generation pattern + - [ ] Design: Answer planning questions in `activity-3.3-test-generation.md` + - [ ] Build: Create template between markers + - [ ] Test: Run `test_activity_3_3()` with discount system + - [ ] Compare: Test solution file to see expected output + - [ ] Iterate: Improve based on comparison + +- [ ] **Optional: LLM-as-Judge** (`3.4-llm-as-judge-evaluation.ipynb`) + - [ ] Learn evaluation patterns + - [ ] Understand rubric design + - [ ] See quality validation examples + +**🎊 Completed all activities?** You've mastered SDLC prompt engineering fundamentals! + +--- + +## πŸŽ“ What You'll Learn + +### Activity 3.2: Code Review +- βœ… How to structure comprehensive review prompts +- βœ… Defining clear severity levels and categories +- βœ… Reviewing across multiple dimensions (security, performance, quality) +- βœ… Generating actionable feedback with code examples +- βœ… Combining multiple tactics (role, context, chain-of-thought) +- βœ… Real-world pattern application from AWS + +### Activity 3.3: Test Generation +- βœ… How to identify ambiguities in vague requirements +- βœ… Generating comprehensive edge cases systematically +- βœ… Separating unit tests from integration tests +- βœ… Creating complete test specifications +- βœ… Recommending test infrastructure needs + +--- + +## πŸ”— Quick Links + +- [Notebook 1: Setup & Introduction](../3.1-setup-and-introduction.ipynb) +- [Notebook 2: Code Review Automation](../3.2-code-review-automation.ipynb) +- [Notebook 3: Test Generation Automation](../3.3-test-generation-automation.ipynb) +- [Notebook 4: LLM-as-Judge Evaluation](../3.4-llm-as-judge-evaluation.ipynb) +- [View Solutions](../solutions/) (now testable!) +- [Module 3 README](../README.md) + +--- + +## 🀝 Getting Help + +**Questions?** +- Check the "Common Issues" table in each activity +- Review the notebook examples +- Study the AWS patterns more carefully +- Compare with solutions (after attempting yourself!) + +**Found a bug or have suggestions?** +- Open an issue in the repository +- Share feedback with your instructor + +--- + +## πŸ“ˆ Success Metrics + +You'll know you've mastered these skills when you can: + +1. βœ… Identify which prompt tactics to combine for different SDLC tasks +2. βœ… Design templates that produce consistent, high-quality outputs +3. βœ… Adapt templates to your specific project context +4. βœ… Test and iterate on prompts systematically +5. βœ… Explain design decisions and trade-offs + +**Ready to start?** Open [Activity 3.2](./activity-3.2-code-review.md) and begin! πŸš€ diff --git a/01-course/module-03-applications/activities/activity-3.1-code-review.md b/01-course/module-03-applications/activities/activity-3.2-code-review.md similarity index 96% rename from 01-course/module-03-applications/activities/activity-3.1-code-review.md rename to 01-course/module-03-applications/activities/activity-3.2-code-review.md index a05016a..1b6f643 100644 --- a/01-course/module-03-applications/activities/activity-3.1-code-review.md +++ b/01-course/module-03-applications/activities/activity-3.2-code-review.md @@ -1,4 +1,4 @@ -# Activity 3.1: Build Your Own Code Review Template +# Activity 3.2: Build Your Own Code Review Template **⏱️ Time Required:** 30-40 minutes **🎯 Difficulty:** Intermediate @@ -69,7 +69,7 @@ Any reusable {{variables}} you want: 1. Scroll to the template block below and edit only the content between `` and ``. 2. Replace placeholder text with your own role, guidelines, tasks, and output format. 3. Stick with the XML shell shown, or switch the code fence (e.g., to ````markdown) and rewrite it in structured Markdownβ€”the tester will capture everything between the markers either way. -4. Save the file, then open `3.2-code-review-automation.ipynb` and run `test_activity_3_1()` to check your work. +4. Save the file, then open `3.2-code-review-automation.ipynb` and run `test_activity_3_2()` to check your work. **Helpful reminders** - Leave the HTML comments (``) in place so the tester can find your template. @@ -80,7 +80,7 @@ Any reusable {{variables}} you want:
    ❓ Why do I need those HTML comment markers? (Click to expand) -The `` and `` markers tell the `test_activity_3_1()` function where your template begins and ends. They're invisible when markdown is rendered but essential for the auto-testing feature! +The `` and `` markers tell the `test_activity_3_2()` function where your template begins and ends. They're invisible when markdown is rendered but essential for the auto-testing feature!
    @@ -144,7 +144,7 @@ Purpose: {{change_purpose}} * * Next step: Test it! * Go to: 3.2-code-review-automation.ipynb - * Run: test_activity_3_1(test_code="...", variables={...}) + * Run: test_activity_3_2(test_code="...", variables={...}) ******************************************************************************/ ```` @@ -194,7 +194,7 @@ test_code = """ + return {"status": "ok", "session": SESSION_CACHE[username], "permissions": permissions} """ - test_activity_3_1( + test_activity_3_2( test_code=test_code, variables={ 'tech_stack': 'Python', @@ -239,7 +239,7 @@ Before considering this activity complete, verify: ### Compare with Solution Once you're satisfied with your template, compare it with the official solution: -πŸ“– [`solutions/activity-3.1-code-review-solution.md`](../solutions/activity-3.1-code-review-solution.md) +πŸ“– [`solutions/activity-3.2-code-review-solution.md`](../solutions/activity-3.2-code-review-solution.md) ### Keep Iterating - Save a copy of your finished template in your repo (for example, `prompts/code-review-template.xml`) so you can reuse and improve it. diff --git a/01-course/module-03-applications/activities/activity-3.3-test-generation.md b/01-course/module-03-applications/activities/activity-3.3-test-generation.md new file mode 100644 index 0000000..46e988a --- /dev/null +++ b/01-course/module-03-applications/activities/activity-3.3-test-generation.md @@ -0,0 +1,300 @@ +# Activity 3.3: Build Your Own Test Generation Template + +**⏱️ Time Required:** 30-40 minutes +**🎯 Difficulty:** Intermediate +**πŸ“š Prerequisites:** Complete Section 2 of `3.3-test-generation-automation.ipynb` + +--- + +## 🎯 Your Mission + +Build a production-ready **command-style** test generation prompt template that analyzes requirements, spots ambiguities, and produces reusable specs for an e-commerce discount system. You will research, design, and assemble the four building blocksβ€”``, ``, ``, and ``β€”so the model delivers consistent, automation-friendly test plans. + +--- + +## πŸ“‹ Success Criteria + +Your finished template should: +- βœ… Capture intent, success signals, and consumers inside `` +- βœ… Bundle all context (project overview, functional requirements, existing tests) inside `` +- βœ… Force deliberate thinking about coverage gaps via `` +- βœ… Emit a predictable test specification inside `` +- βœ… Separate unit vs integration tests and call out infrastructure needs + +--- + +## πŸ” Scenario Snapshot + +You are supporting a shopping cart discount system with intentionally vague requirements: +- Users can apply discount codes (percentage or fixed amount) +- Codes have expiration dates and usage limits +- Cart total must stay positive after discounts +- Business rule: only one discount per order + +Your template must highlight unclear requirements, invent edge cases, and document the tests your QA team needs. + +--- + +## πŸ“ Working Plan + +### Step 1 β€” Research the Pattern (10-15 minutes) +1. Read the [AWS Anthropic Test Generation Command](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md). +2. Note how they phrase the command summary, organize inputs, guide reasoning, and structure the output. + +Use the space below for quick notes: +``` +Command summary ingredients: +- + +Inputs to carry over: +- + +Reasoning checklist prompts I like: +- + +Output elements worth keeping: +- +``` + +### Step 2 β€” Blueprint Your Command Template (10-15 minutes) +Answer these prompts before editing the template: +``` +Command summary focus (mission, consumers, success signals): +- + +System inputs to include (project context, requirements, coverage snapshot): +- + +Reasoning checklist questions that surface gaps: +1. +2. +3. +4. + +Output contract sections and formatting rules: +- + +Reusable {{variables}} I plan to support: +- +``` +> Tip: Tune the four tags to match the test case you are targeting. Updating the command summary, inputs, checklist, and output contract is the fastest way to adapt the template to a new feature. + +### Step 3 β€” Build & Test (15-20 minutes) +1. Scroll to the editable template block below. +2. Modify only the content between `` and ``. +3. Replace the TODO comments with your command summary, system inputs, reasoning checklist, and output contract. +4. Keep the XML shell, `{{variables}}`, and HTML comments so the testing helper can extract your work. +5. Save the file and run `test_activity_3_3()` in the notebook when you are ready. + +βœ… **Do:** update the four tags to optimise the prompt for the discount system (or any feature you target). +❌ **Don’t:** remove the XML tags, change the code fence, or delete the comment markersβ€”those are required for automated testing. + +
    +❓ Why keep the HTML comment markers? + +The `` and `` markers tell the `test_activity_3_3()` helper exactly where your template begins and ends. They disappear in rendered Markdown but are essential for automation. + +
    + +--- + +## πŸ‘‡ YOUR EDITABLE TEMPLATE IS BELOW πŸ‘‡ + +```xml +/******************************************************************************* + * ✏️ EDIT YOUR TEMPLATE BETWEEN THE COMMENT BLOCKS + * + * The test function extracts everything between: + * and + * + * Instructions: + * 1. Replace TODO comments with your content + * 2. Focus on coverage gaps, ambiguity detection, and reusable output + * 3. Keep the four command sections intact + * 4. Use {{variables}} for parameterisation + ******************************************************************************/ + + + + +Command: TODO - Replace with when and why the team runs this command. +Primary Objective: TODO - Define the goal for coverage or decision-making. +Success Signals: +- TODO - What confirms coverage is sufficient? +- TODO - What ambiguities must be surfaced? +- TODO - How should output stay automation-ready? + + + + + + +{{project_context}} + + + +{{functional_requirements}} + + + +{{existing_tests}} + + + + + +1. TODO - Anchor context in one or two bullets. +2. TODO - Compare requirements to existing tests and log risk themes. +3. TODO - Capture ambiguities or missing business rules that block automation. +4. TODO - Expand uncovered scenarios into test specs (unit/integration) with setup, steps, and expected results before drafting the final answer. + + + + +## Summary +- [Product goal] +- [High-risk areas or notable gaps] + +## Ambiguities & Follow-ups +- **Question:** [Open question] + **Why it matters:** [Impact if unanswered] + **Assumption (if unclarified):** [Temporary assumption] + +## Coverage Map +| Theme | Risk Level | Missing Scenario | +| --- | --- | --- | + +## Unit Tests +### Test: [Name] +**Goal:** [Purpose] +**Setup:** [Data, mocks] +**Steps:** +1. ... +2. ... +**Expected:** [...] +**Priority:** [P0/P1/P2] + +## Integration Tests +### Test: [Name] +**Goal:** [Purpose] +**Setup:** [Services, data] +**Steps:** +1. ... +2. ... +**Expected:** [...] +**Priority:** [P0/P1/P2] + +## Test Data & Tooling +- Mocks/Stubs: [...] +- Fixtures: [...] +- Environments: [...] + + + +/******************************************************************************* + * YOUR TEMPLATE ENDS HERE + * + * Next step: Test it! + * Go to: 3.3-test-generation-automation.ipynb + * Run: test_activity_3_3(test_code="...", variables={...}) + ******************************************************************************/ +``` + +--- + +### Step 4 β€” Test Your Template + +**πŸ§ͺ Try it in the notebook:** + +Open `3.3-test-generation-automation.ipynb` and run: + +```python +# Shopping Cart Discount System requirements (intentionally vague!) +discount_requirements = """ +Feature: Shopping Cart Discount System + +Requirements: +1. Users can apply discount codes at checkout +2. Discount types: percentage (10%, 25%, etc.) or fixed amount ($5, $20, etc.) +3. Each discount code has an expiration date +4. Usage limits: one-time use OR unlimited +5. Business rule: Discounts cannot be combined (one per order) +6. Cart total must be > 0 after discount applied +7. Fixed discounts cannot exceed cart total +""" + +existing_tests = """ +Current test suite (minimal coverage): +- test_apply_percentage_discount() - 10% off $100 cart +- test_fixed_amount_discount() - $5 off $50 cart +""" + +project_context = """ +Domain: E-commerce platform +Project: Shopping Cart Discount Service +Primary test framework: pytest +Tech stack: Python/Flask +""" + +test_activity_3_3( + test_code=discount_requirements, + variables={ + 'project_context': project_context, + 'functional_requirements': discount_requirements, + 'existing_tests': existing_tests + } +) +``` + +**Your template's output:** + +``` +[Results will be automatically saved here when you test] +``` + + +**Self-Check:** +- [ ] Did `` make the mission and success signals explicit? +- [ ] Did `` capture everything the model needs in one scan? +- [ ] Did `` surface ambiguities and risk themes? +- [ ] Did `` separate unit vs integration tests with full specs? +- [ ] Did you call out infrastructure needs and priorities? + +--- + +## πŸ§ͺ Extra Experiments + +After your first run, iterate: +- Swap in a new `{{project_context}}` for a different domain (e.g., SaaS billing). +- Feed altered requirements (e.g., stackable discounts) and confirm ambiguities change. +- Trim success signals to see how the output shifts, then refine them. + +--- + +## βœ… Completion Checklist + +- [ ] Template highlights intent, inputs, reasoning, and deliverables clearly +- [ ] Ambiguities and coverage gaps are called out before test specs +- [ ] Unit and integration tests are separated with reusable formatting +- [ ] Test infrastructure needs are listed (mocks, data, environments) +- [ ] Template uses `{{variables}}` for easy reuse +- [ ] Tested with the helper function and reviewed the output + +--- + +## πŸš€ Next Steps + +- Compare with the reference solution: [`solutions/activity-3.3-test-generation-solution.md`](../solutions/activity-3.3-test-generation-solution.md) +- Reflect on what you learned: +``` +1. +2. +3. +``` +- Ready for more? Continue to `3.4-llm-as-judge-evaluation.ipynb` for model-based evaluation patterns. + +--- + +## πŸ’‘ Need Help? + +Drop a note in your team's channel or revisit the AWS pattern for inspiration. Small tweaks to the four tags go a long wayβ€”focus on clarity over volume. diff --git a/01-course/module-03-applications/setup_utils.py b/01-course/module-03-applications/setup_utils.py index 175b901..19cff6f 100644 --- a/01-course/module-03-applications/setup_utils.py +++ b/01-course/module-03-applications/setup_utils.py @@ -244,13 +244,13 @@ def test_activity(activity_file, test_code=None, variables=None, auto_save=True) Test your activity template directly from the .md file. IMPORTANT: Complete your activity template BEFORE running this function! - - Open the activity file (e.g., 'activities/activity-3.1-code-review.md') + - Open the activity file (e.g., 'activities/activity-3.2-code-review.md') - Replace all comments with your actual content - Fill in role, guidelines, tasks, and output format sections - Save the file, then run this test function Args: - activity_file: Path to your activity file (e.g., 'activities/activity-3.1-code-review.md') + activity_file: Path to your activity file (e.g., 'activities/activity-3.2-code-review.md') test_code: Optional code sample to review (uses example from file if not provided) variables: Optional dict of template variables (e.g., {'tech_stack': 'Python', 'repo_name': 'my-app'}) auto_save: If True, prompts to save result back to activity file @@ -428,46 +428,62 @@ def list_activities(): print() print("="*70) - print("πŸ’‘ Usage: test_activity('activities/activity-3.1-code-review.md')") + print("πŸ’‘ Usage: test_activity('activities/activity-3.2-code-review.md')") # Quick access functions for each activity -def test_activity_3_1(test_code=None, variables=None): +def test_activity_3_2(test_code=None, variables=None): """ - Quick helper for Activity 3.1: Comprehensive Code Review + Quick helper for Activity 3.2: Comprehensive Code Review IMPORTANT: Complete your template in the activity file BEFORE running this! """ - return test_activity('activities/activity-3.1-code-review.md', test_code=test_code, variables=variables) + return test_activity('activities/activity-3.2-code-review.md', test_code=test_code, variables=variables) -def test_activity_3_1_solution(test_code=None, variables=None): +def test_activity_3_2_solution(test_code=None, variables=None): """ - Test the provided solution for Activity 3.1: Comprehensive Code Review + Test the provided solution for Activity 3.2: Comprehensive Code Review Use this to see how the solution template works before building your own. Note: auto_save is disabled for solution files to keep them as clean references. """ - return test_activity('solutions/activity-3.1-code-review-solution.md', test_code=test_code, variables=variables, auto_save=False) + return test_activity('solutions/activity-3.2-code-review-solution.md', test_code=test_code, variables=variables, auto_save=False) -def test_activity_3_2(test_code=None, variables=None): +def test_activity_3_3(test_code=None, variables=None): """ - Quick helper for Activity 3.2: Test Generation + Quick helper for Activity 3.3: Test Generation IMPORTANT: Complete your template in the activity file BEFORE running this! """ - return test_activity('activities/activity-3.2-test-generation.md', test_code=test_code, variables=variables) + return test_activity('activities/activity-3.3-test-generation.md', test_code=test_code, variables=variables) -def test_activity_3_2_solution(test_code=None, variables=None): +def test_activity_3_3_solution(test_code=None, variables=None): """ - Test the provided solution for Activity 3.2: Test Generation + Test the provided solution for Activity 3.3: Test Generation Use this to see how the solution template works before building your own. Note: auto_save is disabled for solution files to keep them as clean references. """ - return test_activity('solutions/activity-3.2-test-generation-solution.md', test_code=test_code, variables=variables, auto_save=False) + return test_activity('solutions/activity-3.3-test-generation-solution.md', test_code=test_code, variables=variables, auto_save=False) + + +def test_activity_3_1(test_code=None, variables=None): + """ + Backwards-compatible helper for the former Activity 3.1 (now Activity 3.2). + """ + print("⚠️ Activity 3.1 has been renumbered to Activity 3.2. Routing to test_activity_3_2().") + return test_activity_3_2(test_code=test_code, variables=variables) + + +def test_activity_3_1_solution(test_code=None, variables=None): + """ + Backwards-compatible helper for the former Activity 3.1 solution (now Activity 3.2). + """ + print("⚠️ Activity 3.1 solution has been renumbered to Activity 3.2. Routing to test_activity_3_2_solution().") + return test_activity_3_2_solution(test_code=test_code, variables=variables) # ============================================ diff --git a/01-course/module-03-applications/solutions/README.md b/01-course/module-03-applications/solutions/README.md new file mode 100644 index 0000000..85ad860 --- /dev/null +++ b/01-course/module-03-applications/solutions/README.md @@ -0,0 +1,102 @@ +# Module 3 Solutions + +This directory contains detailed solution analyses for the practice activities in Module 3. + +## πŸ“ Solution Files + +### [Activity 3.2: Comprehensive Code Review](activity-3.2-code-review-solution.md) +- Multi-dimensional review template (security, performance, quality) +- Balanced approach inspired by AWS patterns +- Expected findings across all dimensions +- CI/CD pipeline integration examples +- Customization for different tech stacks and contexts + +### [Activity 3.3: Test Generation Sprint](activity-3.3-test-generation-solution.md) +- Ambiguity detection in requirements +- Comprehensive edge case identification +- Unit vs integration test separation +- Sprint planning integration +- Continuous improvement feedback loops + +### [Activity 3.3: Template Customization Challenge](activity-3.3-customization-solution.md) +- Performance review with N+1 query analysis +- Complexity analysis (Big-O notation) +- Adaptation patterns for SRE, API design, React +- When to create domain-specific templates +- Step-by-step customization strategy + +### [Activity 3.4: Quality Evaluation with LLM-as-Judge](activity-3.4-judge-solution.md) +- Complete judge evaluation breakdown +- Production quality gate implementation +- Automated retry logic with feedback +- Monitoring dashboard and metrics +- Success criteria for AI-assisted workflows + +## 🎯 How to Use These Solutions + +1. **Try the activity first** - Complete the exercise in the notebook without looking at solutions +2. **Run your code** - Execute your template and review the AI output +3. **Compare results** - Check the solution to see what best practices you may have missed +4. **Iterate** - Refine your template based on the solution analysis +5. **Customize** - Adapt the patterns for your specific use case + +## πŸ“Š What Makes a Solution "Best Practice"? + +Each solution demonstrates these key principles: + +### 1. Domain-Specific Expertise +- βœ… Role matches the task (Security Engineer, Performance Engineer, QA Lead) +- βœ… Guidelines use domain terminology (OWASP, Big-O, CWE) +- βœ… Output format appropriate for the domain + +### 2. Clear, Measurable Criteria +- βœ… Specific guidelines (not vague "check for issues") +- βœ… Quantified expectations (complexity, severity, coverage) +- βœ… Evidence requirements (line numbers, code snippets, references) + +### 3. Actionable Output +- βœ… Developers can immediately act on recommendations +- βœ… Includes specific code changes or steps +- βœ… Explains WHY, not just WHAT + +### 4. Production-Ready +- βœ… Error handling and fallback strategies +- βœ… Integration with CI/CD workflows +- βœ… Monitoring and metrics for continuous improvement +- βœ… Quality gates with retry logic + +### 5. Scalable and Reusable +- βœ… Parameterized templates ({{placeholders}}) +- βœ… Documented for team use +- βœ… Version controlled with changelogs +- βœ… Adaptable to different contexts + +## πŸ’‘ Beyond the Exercises + +These solutions provide patterns you can apply to other SDLC tasks: + +- **Root Cause Analysis** - Adapt the decomposition + CoT pattern +- **Documentation Review** - Modify the judge rubric for clarity/completeness +- **Architecture Review** - Customize role to "Principal Architect" +- **Incident Post-Mortems** - Use structured output format for consistency + +## πŸš€ Next Steps + +After reviewing solutions: + +1. **Implement in your project** - Start with one template for your team +2. **Customize for your domain** - Adapt roles, guidelines, output format +3. **Measure effectiveness** - Track detection rate, false positives, time saved +4. **Iterate based on feedback** - Use LLM-as-Judge to measure quality over time +5. **Build a template library** - Version control and share with your team + +## πŸ“š Additional Resources + +- **Main Notebook**: [module3.ipynb](../module3.ipynb) +- **Module README**: [README.md](../README.md) +- **Production Template Library**: See cell 39 in the notebook for copy-paste ready code + +--- + +**Questions or improvements?** Open an issue in the repository or submit a PR with your own solution variations! + diff --git a/01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md b/01-course/module-03-applications/solutions/activity-3.2-code-review-solution.md similarity index 98% rename from 01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md rename to 01-course/module-03-applications/solutions/activity-3.2-code-review-solution.md index 0c1f636..230ce01 100644 --- a/01-course/module-03-applications/solutions/activity-3.1-code-review-solution.md +++ b/01-course/module-03-applications/solutions/activity-3.2-code-review-solution.md @@ -1,8 +1,8 @@ -# Activity 3.1 Solution: Comprehensive Code Review Template +# Activity 3.2 Solution: Comprehensive Code Review Template **⏱️ Completion Time:** Reference solution **🎯 Focus:** Multi-dimensional code review (security, performance, quality, best practices) -**πŸ“š Ready to Test:** Use `test_activity_3_1_solution()` with this file path +**πŸ“š Ready to Test:** Use `test_activity_3_2_solution()` with this file path --- @@ -18,7 +18,7 @@ from setup_utils import test_activity # Test the solution template test_activity( - 'solutions/activity-3.1-code-review-solution.md', + 'solutions/activity-3.2-code-review-solution.md', test_code = """ + import hashlib + import time @@ -67,7 +67,7 @@ test_activity( ```xml /******************************************************************************* - * SOLUTION TEMPLATE FOR ACTIVITY 3.1 + * SOLUTION TEMPLATE FOR ACTIVITY 3.2 * * This is a complete, production-ready comprehensive code review template. * Focus areas: diff --git a/01-course/module-03-applications/solutions/activity-3.3-test-generation-solution.md b/01-course/module-03-applications/solutions/activity-3.3-test-generation-solution.md new file mode 100644 index 0000000..67089dd --- /dev/null +++ b/01-course/module-03-applications/solutions/activity-3.3-test-generation-solution.md @@ -0,0 +1,176 @@ +# Activity 3.3 Solution: Test Generation for E-Commerce + +**⏱️ Completion Time:** Reference solution +**🎯 Focus:** Command-style coverage planning, ambiguity detection, reusable specs +**πŸ“š Ready to Test:** Use `test_activity_3_3_solution()` with this file path + +--- + +## 🎯 Complete Working Solution + +This reference solution delivers a production-ready test generation command composed of four sections: ``, ``, ``, and ``. It follows the AWS pattern while tailoring success signals, inputs, and output structure for the shopping cart discount system. + +### How to Test This Solution + +```python +# In 3.3-test-generation-automation.ipynb +from setup_utils import test_activity + +discount_requirements = """ +Feature: Shopping Cart Discount System + +Requirements: +1. Users can apply discount codes at checkout +2. Discount types: percentage (10%, 25%, etc.) or fixed amount ($5, $20, etc.) +3. Each discount code has an expiration date +4. Usage limits: one-time use OR unlimited +5. Business rule: Discounts cannot be combined (one per order) +6. Cart total must be > 0 after discount applied +7. Fixed discounts cannot exceed cart total +""" + +existing_tests = """ +Current test suite (minimal coverage): +- test_apply_percentage_discount() - 10% off $100 cart +- test_fixed_amount_discount() - $5 off $50 cart +""" + +project_context = """ +Domain: E-commerce platform +Project: Shopping Cart Discount Service +Primary test framework: pytest +Tech stack: Python/Flask +""" + +test_activity( + 'solutions/activity-3.3-test-generation-solution.md', + test_code=discount_requirements, + variables={ + 'project_context': project_context, + 'functional_requirements': discount_requirements, + 'existing_tests': existing_tests + } +) +``` + +--- + +## πŸ‘‡ COMPLETE WORKING TEMPLATE BELOW πŸ‘‡ + +```xml +/******************************************************************************* + * SOLUTION TEMPLATE FOR ACTIVITY 3.3 + * + * This command-style template: + * - Sets intent and success signals in + * - Packages all context in + * - Forces deliberate analysis via + * - Produces automation-ready specs in + ******************************************************************************/ + + + +Command: Generate a coverage-first test plan for the Shopping Cart Discount Service before sprint planning. +Primary Objective: Expose untested scenarios, ambiguities, and infrastructure needs so QA can prioritise automation work. +Success Signals: +- Every critical flow (happy path, edge case, error path) has at least one unit or integration test candidate with priority. +- Ambiguities and policy questions are captured with temporary assumptions and follow-up owners. +- Output remains markdown with summary, coverage map, separated test specs, and infrastructure checklist to drop into QA tooling. + + + + +{{project_context}} + + +{{functional_requirements}} + + +{{existing_tests}} + + + + +1. Summarise the product slice in two bullets (business goal + technical scope) to anchor the test strategy. +2. Compare each requirement against the existing tests and label risk themes (happy path, boundary, error handling, policy, security). +3. Record ambiguities, unstated rules, or data dependencies that could block automation; assign provisional assumptions if clarification is pending. +4. For every uncovered scenario, outline a test specification (unit or integration) including setup, steps, expected result, and priority before drafting the final markdown output. +5. Note any infrastructure, data, or tooling gaps (mocks, fixtures, clock control) needed to implement the plan. + + + +## Summary +- Product goal: [One-sentence mission] +- High-risk areas: [Top 2-3 risk themes surfaced] + +## Ambiguities & Follow-ups +- **Question:** [What's unclear?] + **Why it matters:** [Potential impact] + **Owner / Next step:** [Who clarifies] + **Assumption (until resolved):** [Working assumption] + +## Coverage Map +| Theme | Risk Level | Missing Scenario | Notes | +| --- | --- | --- | --- | + +## Unit Tests +### Test: [Descriptive name] +**Goal:** [Behaviour or rule validated] +**Setup:** [Data, fixtures, mocks] +**Steps:** +1. ... +2. ... +**Expected:** [...] +**Priority:** [P0/P1/P2] +**Why it matters:** [Business/tech impact] + +[Repeat subsection for each unit test] + +## Integration Tests +### Test: [Descriptive name] +**Goal:** [Workflow validated] +**Setup:** [Services, data, sequencing] +**Steps:** +1. ... +2. ... +**Expected:** [...] +**Priority:** [P0/P1/P2] +**Why it matters:** [Business/tech impact] + +[Repeat subsection for each integration test] + +## Test Data & Tooling +- Mocks/Stubs: [Payment gateway, clock, notification service, etc.] +- Fixtures: [Discount codes by status, carts at boundary totals, user segments] +- Environments: [Local, staging, feature flags] +- Observability: [Logs/metrics needed for validation] + +## Implementation Roadmap +- P0 (Critical): [Tests to ship first] +- P1 (High): [Next wave] +- P2 (Medium): [Follow-up scenarios] + +## Success Checklist +- [ ] P0 coverage implemented and passing +- [ ] Ambiguities resolved or tracked +- [ ] Test data & tooling available in CI +- [ ] Regression criteria documented + + + +/******************************************************************************* + * SOLUTION TEMPLATE ENDS HERE + ******************************************************************************/ +``` + +--- + +## βœ… Why This Solution Works + +- **Clear Intent:** `` states when to use the template, what success looks like, and who consumes the output. +- **Single-Pass Inputs:** `` groups project overview, requirements, and existing tests so the model never loses context. +- **Deliberate Reasoning:** `` pushes the model to analyse before writing specs, mirroring risk reviews in real teams. +- **Automation-Ready Output:** `` mirrors the markdown structure used in the notebooks and CI tooling (summary, ambiguities, coverage map, test specs, infrastructure, roadmap). +- **Prioritisation:** Including a roadmap and success checklist helps teams plan implementation effort, not just enumerate tests. + +Use this file to benchmark your own template, then iterate to match your team's domain or tooling needs. From f57e23e8ddcfe55560bfd3474635bb161ec06caf Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Fri, 17 Oct 2025 18:14:30 +0530 Subject: [PATCH 5/6] Complete Module 3 with three sections and solutions for all activities --- .../3.4-llm-as-judge-evaluation.ipynb | 522 +++ .../module-03-applications/module3.ipynb | 3550 +++++++++++++++++ .../activity-3.3-customization-solution.md | 354 ++ .../solutions/activity-3.4-judge-solution.md | 416 ++ session_1_introduction_and_basics.ipynb | 1048 +++++ 5 files changed, 5890 insertions(+) create mode 100644 01-course/module-03-applications/3.4-llm-as-judge-evaluation.ipynb create mode 100644 01-course/module-03-applications/module3.ipynb create mode 100644 01-course/module-03-applications/solutions/activity-3.3-customization-solution.md create mode 100644 01-course/module-03-applications/solutions/activity-3.4-judge-solution.md create mode 100644 session_1_introduction_and_basics.ipynb diff --git a/01-course/module-03-applications/3.4-llm-as-judge-evaluation.ipynb b/01-course/module-03-applications/3.4-llm-as-judge-evaluation.ipynb new file mode 100644 index 0000000..3d93056 --- /dev/null +++ b/01-course/module-03-applications/3.4-llm-as-judge-evaluation.ipynb @@ -0,0 +1,522 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Section 3.4: Evaluate Your Prompt Templates with LLM-as-Judge\n", + "\n", + "| **Aspect** | **Details** |\n", + "|-------------|-------------|\n", + "| **Goal** | Add an evaluation layer that scores outputs from your prompt templates before they reach production |\n", + "| **Time** | ~25 minutes |\n", + "| **Prerequisites** | Sections 3.1–3.3 complete, `setup_utils.py` loaded |\n", + "| **What You'll Strengthen** | Trustworthy automation, rubric design, quality gates |\n", + "| **Next Steps** | Return to the [Module 3 overview](./README.md) or wire scores into your workflow |\n", + "\n", + "---\n", + "\n", + "You just built reusable prompt templates in Sections 3.2 and 3.3. Now you'll learn how to **evaluate those AI outputs** with an LLM-as-Judge so you can accept great responses, request revisions, or escalate risky ones." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ Quick Setup Check\n", + "\n", + "Since you completed Section 1, setup is already done! We just need to import it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Quick setup check - imports setup_utils\n", + "try:\n", + " import importlib\n", + " import setup_utils\n", + " importlib.reload(setup_utils)\n", + " from setup_utils import *\n", + " print(f\"βœ… Setup loaded! Using {PROVIDER.upper()} with {get_default_model()}\")\n", + " print(\"πŸš€ Ready to score AI outputs with an LLM judge!\")\n", + "except ImportError:\n", + " print(\"❌ Setup not found!\")\n", + " print(\"πŸ’‘ Please run 3.1-setup-and-introduction.ipynb first to set up your environment.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βš–οΈ LLM-as-Judge Evaluation Template\n", + "\n", + "### Building the Evaluation Loop for Your Prompt Templates\n", + "\n", + "
    \n", + "🎯 What You'll Build in This Section

    \n", + "\n", + "You'll create an **LLM-as-Judge rubric** that reviews the output produced by your prompt templates. The judge scores the response, explains its verdict, and tells you whether to accept it, request a revision, or fall back to a human reviewer.\n", + "

    \n", + "Time Required: ~25 minutes (learn, see the example, then try it on your own outputs)\n", + "
    \n", + "\n", + "Layering a judge after your templates keeps quality high without sending everything back to humans. In Session 1 we saw that traditional metrics (F1, BLEU, ROUGE) miss hallucinations and manual reviews are too slow to scale. A rubric-driven LLM judge gives you semantic understanding *and* consistent scoring.\n", + "\n", + "#### 🎯 The Problem We're Solving\n", + "\n", + "1. **🚨 Silent Failures**\n", + " - Template-generated outputs can look polished while hiding factual or security mistakes.\n", + " - Legacy metrics can't flag these issues because they only check surface-level overlap.\n", + "\n", + "2. **⏳ Manual QA Bottlenecks**\n", + " - Human spot checks take days and don't scale to thousands of AI responses.\n", + " - Feedback arrives too late to keep CI/CD pipelines moving.\n", + "\n", + "3. **🎯 Inconsistent Standards**\n", + " - Without a codified rubric, every reviewer (human or AI) applies different criteria.\n", + " - Teams struggle to know when to ship, regenerate, or escalate." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### πŸ—οΈ How We'll Build It: The Tactical Combination\n", + "\n", + "We chain together Module 2 tactics plus what you learned about judges in Session 1.\n", + "\n", + "| **Tactic** | **Purpose in This Template** | **Why Modern LLMs Need This** |\n", + "|------------|------------------------------|-------------------------------|\n", + "| **Role Prompting** | Positions the judge as a principal engineer with review authority | Anchors the evaluation in expert expectations instead of generic chat replies |\n", + "| **Structured Inputs** | Separates context, rubric, and submission using XML-style tags | Prevents the model from blending instructions with the artifact under review |\n", + "| **Rubric Decomposition** | Breaks quality into weighted criteria | Mirrors Session 1 guidance: multi-dimensional scoring avoids naive pass/fail |\n", + "| **Chain-of-Thought Justification** | Forces rationale before the decision | Produces auditable feedback and catches hallucinations sooner |\n", + "| **Decision Thresholds** | Maps weighted score to Accept / Revise / Reject actions | Gives your pipeline a clear automation hook instead of reading prose |\n", + "\n", + "
    \n", + "Reminder from Session 1

    \n", + "Relying on a single yes/no question (for example, 'Is this output correct?') lets hidden errors slip through. Weighted rubrics with explicit thresholds give you measurable guardrails.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ€” Why Add a Judge After Prompt Templates?\n", + "\n", + "- **Detect hidden regressions:** LLM judges evaluate meaning, so paraphrased but wrong answers score poorly even when lexical metrics look fine.\n", + "- **Keep automation trustworthy:** A second AI call verifies that template outputs meet the same criteria every time, reducing escalation load.\n", + "- **Accelerate iteration:** Scores highlight which tactic block to tweak, letting you A/B test prompts without waiting for human reviewers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“‹ LLM-as-Judge Rubric Template\n", + "\n", + "```xml\n", + "\n", + "You are a Principal Engineer reviewing AI-generated code feedback.\n", + "\n", + "\n", + "\n", + "1. Accuracy (40%): Do identified issues actually exist and are correctly described?\n", + "2. Completeness (30%): Are major concerns covered? Any critical issues missed?\n", + "3. Actionability (20%): Are recommendations specific and implementable?\n", + "4. Communication (10%): Is the review professional, clear, and well-structured?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with detailed rationale:\n", + "- 5: Excellent - Exceeds expectations\n", + "- 4: Good - Meets expectations with minor gaps\n", + "- 3: Acceptable - Meets minimum bar\n", + "- 2: Needs work - Significant gaps\n", + "- 1: Unacceptable - Fails to meet standards\n", + "\n", + "Calculate weighted total: (AccuracyΓ—0.4) + (CompletenessΓ—0.3) + (ActionabilityΓ—0.2) + (CommunicationΓ—0.1)\n", + "\n", + "Recommend:\n", + "- ACCEPT (β‰₯3.5): Production-ready\n", + "- REVISE (2.5-3.4): Needs improvements, provide specific guidance\n", + "- REJECT (<2.5): Start over with different approach\n", + "\n", + "\n", + "\n", + "{{llm_output_under_review}}\n", + "\n", + "\n", + "\n", + "Provide structured evaluation with:\n", + "- Individual scores (1-5) with rationale for each criterion\n", + "- Weighted total score\n", + "- Recommendation (ACCEPT/REVISE/REJECT)\n", + "- Specific feedback for improvements\n", + "\n", + "```\n", + "\n", + "#### πŸ”‘ Rubric Design Principles\n", + "\n", + "1. **Weighted Criteria** – Prioritise what matters most (accuracy first for safety-critical domains).\n", + "2. **Explicit Scale** – Clear definitions stop the judge from drifting between runs.\n", + "3. **Evidence-Based Rationale** – Forces the model to ground scores in the submission.\n", + "4. **Actionable Thresholds** – Numeric gates let pipelines auto-approve or request revisions.\n", + "5. **Improvement Guidance** – \"Revise\" outcomes must include next steps for the generator.\n", + "\n", + "#### πŸ§ͺ Calibration Framework\n", + "\n", + "The rubric above tells the judge **what** to score; calibration makes sure everyone scores it the **same way**. Treat calibration notes as the companion playbook that keeps your accuracy/completeness/actionability/communication scores aligned across reviewers and over time.\n", + "\n", + "Instead of generic \"7/10 - pretty good\" language, define what each score means. For example, **7/10 = factually accurate with minor gaps, clear structure, appropriate for the target audience, but missing one or two implementation details.**\n", + "\n", + "#### πŸ› οΈ Use-Case Calibration Examples\n", + "\n", + "Tie calibration back to your weighted criteria: the examples below show how different score levels reflect accuracy, completeness, actionability, and communication in a documentation context.\n", + "\n", + "| Scenario | 9/10 | 5/10 | 2/10 |\n", + "|----------|------|------|------|\n", + "| Technical documentation | Complete, tested, and handles edge cases | Covers main flows, some gaps in error handling | Only basic concepts, missing implementation details |\n", + "\n", + "#### πŸ“ Calibration Best Practices\n", + "\n", + "- **Anchor scores:** Use real examples for every score level so the judge can compare and map them back to the rubric criteria.\n", + "- **Regular recalibration:** Review rubrics quarterly with domain experts and adjust thresholds or weights as standards evolve.\n", + "- **Inter-rater reliability:** Have multiple calibrators score the same samples to confirm they interpret the rubric the same way.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### πŸ’» Working Example: Judge the Section 3.2 Code Review\n", + "\n", + "This cell replays the Section 3.2 template to generate the comprehensive AI review, then immediately scores it with the judge using the same monthly report diff.\n", + "\n", + "**What you'll see:**\n", + "- The full AI review that the template produces\n", + "- How the rubric weights accuracy, completeness, actionability, and communication\n", + "- An Accept/Revise/Reject recommendation tied to the numeric thresholds\n", + "\n", + "
    \n", + "⚠️ Heads-up:

    \n", + "The next cell first replays the Section 3.2 prompt template to regenerate the AI review, then runs the LLM-as-Judge rubric on that fresh output.\n", + "
    \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Judge the Section 3.2 code review output\n", + "\n", + "code_diff = '''\n", + "+ import json\n", + "+ import time\n", + "+ from decimal import Decimal\n", + "+\n", + "+ CACHE = {}\n", + "+\n", + "+ def generate_monthly_report(org_id, db, s3_client):\n", + "+ if org_id in CACHE:\n", + "+ return CACHE[org_id]\n", + "+\n", + "+ query = f\"SELECT * FROM invoices WHERE org_id = '{org_id}' ORDER BY created_at DESC\"\n", + "+ rows = db.execute(query)\n", + "+\n", + "+ total = Decimal(0)\n", + "+ items = []\n", + "+ for row in rows:\n", + "+ total += Decimal(row['amount'])\n", + "+ items.append({\n", + "+ 'id': row['id'],\n", + "+ 'customer': row['customer_name'],\n", + "+ 'amount': float(row['amount'])\n", + "+ })\n", + "+\n", + "+ payload = {\n", + "+ 'org': org_id,\n", + "+ 'generated_at': time.strftime('%Y-%m-%d %H:%M:%S'),\n", + "+ 'total': float(total),\n", + "+ 'items': items\n", + "+ }\n", + "+\n", + "+ key = f\"reports/{org_id}/{int(time.time())}.json\"\n", + "+ time.sleep(0.5)\n", + "+ s3_client.put_object(\n", + "+ Bucket='company-reports',\n", + "+ Key=key,\n", + "+ Body=json.dumps(payload),\n", + "+ ACL='public-read'\n", + "+ )\n", + "+\n", + "+ CACHE[org_id] = key\n", + "+ return key\n", + "'''\n", + "\n", + "review_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You follow structured review templates and produce clear, actionable findings.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "\n", + "Act as a Senior Software Engineer specializing in Python backend services.\n", + "Your expertise covers security best practices, performance tuning, reliability, and maintainable design.\n", + "\n", + "\n", + "\n", + "\n", + "Repository: analytics-platform\n", + "Service: Reporting API\n", + "Purpose: Add a monthly invoice report exporter that finance can trigger\n", + "Change Scope: Review focuses on the generate_monthly_report implementation\n", + "Language: python\n", + "\n", + "\n", + "\n", + "\n", + "{code_diff}\n", + "\n", + "\n", + "\n", + "\n", + "Assess the change across multiple dimensions:\n", + "\n", + "1. Security β€” SQL injection, S3 object exposure, sensitive data handling.\n", + "2. Performance β€” query efficiency, blocking calls, caching behaviour.\n", + "3. Error Handling β€” resilience to empty results, network/storage failures.\n", + "4. Code Quality β€” readability, global state, data conversions.\n", + "5. Correctness β€” totals, currency precision, repeated report generation.\n", + "6. Best Practices β€” configuration management, separation of concerns, testing hooks.\n", + "For each finding, cite the diff line, describe impact, and share an actionable fix.\n", + "\n", + "\n", + "\n", + "\n", + "Step 1 - Think: Analyse the diff using the dimensions listed above.\n", + "Step 2 - Assess: For each issue, capture Severity (CRITICAL/MAJOR/MINOR/INFO), Category, Line, Issue, Impact.\n", + "Step 3 - Suggest: Provide a concrete remediation (code change or process tweak).\n", + "Step 4 - Verdict: Summarise overall risk and recommend APPROVE / REQUEST CHANGES / NEEDS WORK.\n", + "\n", + "\n", + "\n", + "\n", + "## Code Review Summary\n", + "[One paragraph on overall health and primary risks]\n", + "\n", + "## Findings\n", + "### [SEVERITY] Issue Title\n", + "**Category:** [Security / Performance / Quality / Correctness / Best Practices]\n", + "**Line:** [line number]\n", + "**Issue:** [impact-focused description]\n", + "**Recommendation:**\n", + "```\n", + "# safer / faster / cleaner fix here\n", + "```\n", + "\n", + "## Overall Assessment\n", + "**Recommendation:** [APPROVE | REQUEST CHANGES | NEEDS WORK]\n", + "**Summary:** [What to address before merge]\n", + "\n", + "\"\"\"\n", + " },\n", + "]\n", + "\n", + "print(\"πŸ” Generating the Section 3.2 code review...\")\n", + "print(\"=\" * 70)\n", + "ai_generated_review = get_chat_completion(review_messages, temperature=0.0)\n", + "print(ai_generated_review)\n", + "print(\"=\" * 70)\n", + "\n", + "rubric_prompt = \"\"\"\n", + "\n", + "Original pull request diff:\n", + "{context}\n", + "\n", + "AI-generated review to evaluate:\n", + "{ai_output}\n", + "\n", + "\n", + "\n", + "1. Accuracy (40%): Do identified issues actually exist and are correctly described?\n", + "2. Completeness (30%): Are major concerns covered? Any critical issues missed?\n", + "3. Actionability (20%): Are recommendations specific and implementable?\n", + "4. Communication (10%): Is the review professional, clear, and well-structured?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with detailed rationale.\n", + "Calculate weighted total: (AccuracyΓ—0.4) + (CompletenessΓ—0.3) + (ActionabilityΓ—0.2) + (CommunicationΓ—0.1)\n", + "\n", + "Recommend:\n", + "- ACCEPT (β‰₯3.5): Production-ready\n", + "- REVISE (2.5-3.4): Needs improvements \n", + "- REJECT (<2.5): Unacceptable quality\n", + "\n", + "\n", + "Provide structured evaluation with scores, weighted total, recommendation, and feedback.\n", + "\"\"\"\n", + "\n", + "judge_messages = [\n", + " {\"role\": \"system\", \"content\": \"You are a Principal Engineer reviewing AI-generated code feedback.\"},\n", + " {\"role\": \"user\", \"content\": rubric_prompt.format(context=code_diff, ai_output=ai_generated_review)}\n", + "]\n", + "\n", + "print(\"βš–οΈ JUDGE EVALUATION IN PROGRESS...\")\n", + "print(\"=\" * 70)\n", + "judge_result = get_chat_completion(judge_messages, temperature=0.0)\n", + "print(judge_result)\n", + "print(\"=\" * 70)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ‹οΈ Hands-On Practice: Evaluate Your Templates\n", + "\n", + "Use the judge to score the outputs you generated in Activities 3.1 and 3.2." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
    \n", + "⚠️ IMPORTANT: Capture the AI output you want to judge before running the cell below.\n", + "

    \n", + "Steps to complete first:\n", + "
      \n", + "
    • Run test_activity_3_2(...) or test_activity_3_3(...) to generate the AI response from your template.
    • \n", + "
    • Save the original artifact (code snippet, ticket, or plan) that the AI evaluated.
    • \n", + "
    • Copy the AI response into the ai_output_under_review placeholder in the next cell.
    • \n", + "
    • Optional: Store the judge score alongside your activity file for future comparisons.
    • \n", + "
    \n", + "
    \n", + "\n", + "
    \n", + "πŸ’‘ Tip: Run the judge after every major template change. Tracking the scores over time makes regressions obvious and keeps your automation trustworthy.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Score your own AI output with the judge\n", + "\n", + "artifact_context = \"\"\"\n", + "# Paste the original artifact here (code snippet, ticket, requirement, etc.)\n", + "\"\"\"\n", + "\n", + "ai_output_under_review = \"\"\"\n", + "# Paste the AI-generated review or plan you want to evaluate\n", + "\"\"\"\n", + "\n", + "rubric_prompt = \"\"\"\n", + "\n", + "Original artifact:\n", + "{context}\n", + "\n", + "AI-generated output to evaluate:\n", + "{ai_output}\n", + "\n", + "\n", + "\n", + "1. Accuracy (40%): Do identified issues actually exist and are correctly described?\n", + "2. Completeness (30%): Are major concerns covered? Any critical issues missed?\n", + "3. Actionability (20%): Are recommendations specific and implementable?\n", + "4. Communication (10%): Is the review professional, clear, and well-structured?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with detailed rationale.\n", + "Calculate weighted total: (AccuracyΓ—0.4) + (CompletenessΓ—0.3) + (ActionabilityΓ—0.2) + (CommunicationΓ—0.1)\n", + "\n", + "Recommend:\n", + "- ACCEPT (β‰₯3.5): Production-ready\n", + "- REVISE (2.5-3.4): Needs improvements \n", + "- REJECT (<2.5): Unacceptable quality\n", + "\n", + "\n", + "Provide structured evaluation with scores, weighted total, recommendation, and feedback.\n", + "\"\"\"\n", + "\n", + "def run_judge_evaluation(context, ai_output, temp=0.0):\n", + " messages = [\n", + " {\"role\": \"system\", \"content\": \"You are a Principal Engineer reviewing AI-generated code feedback.\"},\n", + " {\"role\": \"user\", \"content\": rubric_prompt.format(context=context, ai_output=ai_output)}\n", + " ]\n", + " print(\"βš–οΈ JUDGE EVALUATION IN PROGRESS...\")\n", + " print(\"=\" * 70)\n", + " result = get_chat_completion(messages, temperature=temp)\n", + " print(result)\n", + " print(\"=\" * 70)\n", + " return result\n", + "\n", + "if artifact_context.strip() and ai_output_under_review.strip():\n", + " run_judge_evaluation(artifact_context, ai_output_under_review)\n", + "else:\n", + " print(\"βœ‹ Add your artifact and AI output above before running the judge.\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“š Learn More: Production-Ready Evaluation Patterns\n", + "\n", + "- [Anthropic: Claude Prompting & Evaluation](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/evaluating-outputs) β€” Advanced rubric techniques and bias checks.\n", + "- [OpenAI Cookbook: Model Grading Patterns](https://cookbook.openai.com/examples/evals/model-graded-eval) β€” How to structure model-graded evaluations and plug them into CI.\n", + "- [Weights & Biases Evaluations Guide](https://docs.wandb.ai/guides/llm-evaluations) β€” Capture judge scores alongside offline experiments.\n", + "- [Session 1 Recap](../../session_1_introduction_and_basics.ipynb) β€” Revisit why automated metrics alone miss hallucinations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βœ… Section 4 Complete!\n", + "\n", + "
    \n", + "πŸŽ‰ Nice work! You just added an evaluation layer to your prompt engineering workflow.\n", + "
    \n", + "\n", + "**Key takeaways**\n", + "- Layer a rubric-driven judge after every major template to catch silent failures.\n", + "- Use weighted criteria and explicit thresholds so automation can act on the score.\n", + "- Archive judge outputs to track drift and prove quality to stakeholders.\n", + "\n", + "**Next up**\n", + "1. Run the judge on your Activity 3.2 and 3.2 outputs.\n", + "2. Feed low-scoring responses back into your template for iteration.\n", + "3. Integrate the judge call into your CI/CD or agent workflow.\n", + "\n", + "
    \n", + " β˜• Time for a quick reset?\n", + " Stretch, hydrate, and come back ready to automate the hand-off to production.\n", + "
    " + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01-course/module-03-applications/module3.ipynb b/01-course/module-03-applications/module3.ipynb new file mode 100644 index 0000000..ffc7e18 --- /dev/null +++ b/01-course/module-03-applications/module3.ipynb @@ -0,0 +1,3550 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Module 3 - Apply Advanced Prompting Engineering Tactics to SDLC\n", + "\n", + "| **Aspect** | **Details** |\n", + "|-------------|-------------|\n", + "| **Goal** | Blend previously mastered strategiesβ€”task decomposition, role prompting, chain-of-thought reasoning, LLM-as-Judge critique, and structured formattingβ€”to design reliable prompts for code review and software development lifecycle (SDLC) activities |\n", + "| **Time** | ~120-150 minutes (2-2.5 hours) |\n", + "| **Prerequisites** | Module 2 completion, Python 3.8+, IDE with notebook support, API access (GitHub Copilot, CircuIT, or OpenAI) |\n", + "| **Setup Required** | Clone the repository and follow [Quick Setup](../../README.md#-quick-setup) before running this notebook |\n", + "\n", + "---\n", + "\n", + "## πŸš€ Ready to Start?\n", + "\n", + "
    \n", + "⚠️ Important:

    \n", + "This module builds directly on Module 2 techniques. Make sure you've completed Module 2 before starting.
    \n", + "
    \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ Setup: Environment Configuration\n", + "\n", + "### Step 1: Install Required Dependencies\n", + "\n", + "Let's start by installing the packages we need for this tutorial.\n", + "\n", + "Run the cell below. You should see a success message when installation completes:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages for Module 3\n", + "import subprocess\n", + "import sys\n", + "\n", + "def install_requirements():\n", + " try:\n", + " # Install from requirements.txt\n", + " subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"-r\", \"requirements.txt\"])\n", + " print(\"βœ… SUCCESS! Module 3 dependencies installed successfully.\")\n", + " print(\"πŸ“¦ Ready for: openai, anthropic, python-dotenv, requests\")\n", + " except subprocess.CalledProcessError as e:\n", + " print(f\"❌ Installation failed: {e}\")\n", + " print(\"πŸ’‘ Try running: pip install openai anthropic python-dotenv requests\")\n", + "\n", + "install_requirements()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2: Connect to AI Model\n", + "\n", + "
    \n", + "πŸ’‘ Note:

    \n", + "The code below runs on your local machine and connects to AI services over the internet.\n", + "
    \n", + "\n", + "Choose your preferred option:\n", + "\n", + "- **Option A: GitHub Copilot API (local proxy)** ⭐ **Recommended**: \n", + " - Supports both **Claude** and **OpenAI** models\n", + " - No API keys needed - uses your GitHub Copilot subscription\n", + " - Follow [GitHub-Copilot-2-API/README.md](../../GitHub-Copilot-2-API/README.md) to authenticate and start the local server\n", + " - Run the setup cell below and **edit your preferred provider** (`\"openai\"` or `\"claude\"`) by setting the `PROVIDER` variable\n", + " - Available models:\n", + " - **OpenAI**: gpt-4o, gpt-4, gpt-3.5-turbo, o3-mini, o4-mini\n", + " - **Claude**: claude-3.5-sonnet, claude-3.7-sonnet, claude-sonnet-4\n", + "\n", + "- **Option B: OpenAI API**: If you have OpenAI API access, uncomment and run the **Option B** cell below.\n", + "\n", + "- **Option C: CircuIT APIs (Azure OpenAI)**: If you have CircuIT API access, uncomment and run the **Option C** cell below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Option A: GitHub Copilot API setup (Recommended)\n", + "import openai\n", + "import anthropic\n", + "import os\n", + "\n", + "# ============================================\n", + "# 🎯 CHOOSE YOUR AI MODEL PROVIDER\n", + "# ============================================\n", + "# Set your preference: \"openai\" or \"claude\"\n", + "PROVIDER = \"claude\" # Change to \"claude\" to use Claude models\n", + "\n", + "# ============================================\n", + "# πŸ“‹ Available Models by Provider\n", + "# ============================================\n", + "# OpenAI Models (via GitHub Copilot):\n", + "# - gpt-4o (recommended, supports vision)\n", + "# - gpt-4\n", + "# - gpt-3.5-turbo\n", + "# - o3-mini, o4-mini\n", + "#\n", + "# Claude Models (via GitHub Copilot):\n", + "# - claude-3.5-sonnet (recommended, supports vision)\n", + "# - claude-3.7-sonnet (supports vision)\n", + "# - claude-sonnet-4 (supports vision)\n", + "# ============================================\n", + "\n", + "# Configure clients for both providers\n", + "openai_client = openai.OpenAI(\n", + " base_url=\"http://localhost:7711/v1\",\n", + " api_key=\"dummy-key\"\n", + ")\n", + "\n", + "claude_client = anthropic.Anthropic(\n", + " api_key=\"dummy-key\",\n", + " base_url=\"http://localhost:7711\"\n", + ")\n", + "\n", + "# Set default models for each provider\n", + "OPENAI_DEFAULT_MODEL = \"gpt-5\"\n", + "CLAUDE_DEFAULT_MODEL = \"claude-sonnet-4\"\n", + "\n", + "\n", + "def _extract_text_from_blocks(blocks):\n", + " \"\"\"Extract text content from response blocks returned by the API.\"\"\"\n", + " parts = []\n", + " for block in blocks:\n", + " text_val = getattr(block, \"text\", None)\n", + " if isinstance(text_val, str):\n", + " parts.append(text_val)\n", + " elif isinstance(block, dict):\n", + " t = block.get(\"text\")\n", + " if isinstance(t, str):\n", + " parts.append(t)\n", + " return \"\\n\".join(parts)\n", + "\n", + "\n", + "def get_openai_completion(messages, model=None, temperature=0.0):\n", + " \"\"\"Get completion from OpenAI models via GitHub Copilot.\"\"\"\n", + " if model is None:\n", + " model = OPENAI_DEFAULT_MODEL\n", + " try:\n", + " response = openai_client.chat.completions.create(\n", + " model=model,\n", + " messages=messages,\n", + " temperature=temperature\n", + " )\n", + " return response.choices[0].message.content\n", + " except Exception as e:\n", + " return f\"❌ Error: {e}\\nπŸ’‘ Make sure GitHub Copilot proxy is running on port 7711\"\n", + "\n", + "\n", + "def get_claude_completion(messages, model=None, temperature=0.0):\n", + " \"\"\"Get completion from Claude models via GitHub Copilot.\"\"\"\n", + " if model is None:\n", + " model = CLAUDE_DEFAULT_MODEL\n", + " try:\n", + " response = claude_client.messages.create(\n", + " model=model,\n", + " max_tokens=8192,\n", + " messages=messages,\n", + " temperature=temperature\n", + " )\n", + " return _extract_text_from_blocks(getattr(response, \"content\", []))\n", + " except Exception as e:\n", + " return f\"❌ Error: {e}\\nπŸ’‘ Make sure GitHub Copilot proxy is running on port 7711\"\n", + "\n", + "\n", + "def get_chat_completion(messages, model=None, temperature=0.0):\n", + " \"\"\"\n", + " Generic function to get chat completion from any provider.\n", + " Routes to the appropriate provider-specific function based on PROVIDER setting.\n", + " \"\"\"\n", + " if PROVIDER.lower() == \"claude\":\n", + " return get_claude_completion(messages, model, temperature)\n", + " else: # Default to OpenAI\n", + " return get_openai_completion(messages, model, temperature)\n", + "\n", + "\n", + "def get_default_model():\n", + " \"\"\"Get the default model for the current provider.\"\"\"\n", + " if PROVIDER.lower() == \"claude\":\n", + " return CLAUDE_DEFAULT_MODEL\n", + " else:\n", + " return OPENAI_DEFAULT_MODEL\n", + "\n", + "\n", + "# ============================================\n", + "# πŸ§ͺ TEST CONNECTION\n", + "# ============================================\n", + "print(\"πŸ”„ Testing connection to GitHub Copilot proxy...\")\n", + "test_result = get_chat_completion([\n", + " {\"role\": \"user\", \"content\": \"Say 'Connection successful!' if you can read this.\"}\n", + "])\n", + "\n", + "if test_result and (\"successful\" in test_result.lower() or \"success\" in test_result.lower()):\n", + " print(f\"βœ… Connection successful! Using {PROVIDER.upper()} provider with model: {get_default_model()}\")\n", + " print(f\"πŸ“ Response: {test_result}\")\n", + "else:\n", + " print(\"⚠️ Connection test completed but response unexpected:\")\n", + " print(f\"πŸ“ Response: {test_result}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎯 Applying Prompt Engineering to SDLC Tasks\n", + "\n", + "---\n", + "\n", + "### Introduction: From Tactics to Real-World Applications\n", + "\n", + "#### πŸš€ Ready to Transform Your Development Workflow?\n", + "\n", + "You've successfully mastered the core tactics in Module 2. Now comes the exciting part - **applying these techniques to real-world software engineering challenges** that you face every day.\n", + "\n", + "Think of what you've accomplished so far as **learning individual martial arts moves**. Now we're going to **choreograph them into powerful combinations** that solve actual development problems.\n", + "\n", + "\n", + "#### πŸ‘¨β€πŸ’» What You're About to Master\n", + "\n", + "In the next sections, you'll discover **how to combine tactics strategically** to build production-ready prompts for critical SDLC tasks:\n", + "\n", + "
    \n", + "\n", + "
    \n", + "πŸ” Code Review Automation
    \n", + "Comprehensive review prompts with structured feedback\n", + "
    \n", + "\n", + "
    \n", + "πŸ§ͺ Test Generation & QA
    \n", + "Smart test plans with coverage gap analysis\n", + "
    \n", + "\n", + "
    \n", + "βš–οΈ Quality Validation
    \n", + "LLM-as-Judge rubrics for output verification\n", + "
    \n", + "\n", + "
    \n", + "πŸ“‹ Reusable Templates
    \n", + "Parameterized prompts for CI/CD integration\n", + "
    \n", + "\n", + "
    \n", + "\n", + "
    \n", + "πŸ’‘ Pro Tip:

    \n", + "This module covers practical applications over 120-150 minutes. Take short breaks between sections to reflect on how each template applies to your projects. Make notes as you progressβ€”jot down specific use cases from your codebase. The key skill is learning which tactic combinations solve which problems!\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“ How to Use Break Points\n", + "\n", + "
    \n", + "πŸ’‘ Taking Breaks? We've Got You Covered!

    \n", + "\n", + "This module is designed for 120-150 minutes of focused learning. To help you manage your time effectively, we've added **4 strategic break points** throughout:\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
    Break PointLocationTime ElapsedBookmark Text
    β˜• Break #1After Section 1~40 min\"Section 2: Test Case Generation Template\"
    🍡 Break #2After Section 2~75 min\"Section 3: LLM-as-Judge Evaluation Rubric\"
    πŸ§ƒ Break #3After Section 3~105 min\"Hands-On Practice Activities\"
    🎯 Break #4After Practice Activities~145 min\"Section 4: Template Best Practices\"
    \n", + "\n", + "**How to Resume Your Session:**\n", + "1. Scroll down to find the colorful break point card you last saw\n", + "2. Look for the **\"πŸ“Œ BOOKMARK TO RESUME\"** section\n", + "3. Use `Ctrl+F` (or `Cmd+F` on Mac) to search for the bookmark text\n", + "4. You'll jump right to where you left off!\n", + "\n", + "**Pro Tip:** Each break point card shows:\n", + "- βœ… What you've completed\n", + "- ⏭️ What's coming next\n", + "- ⏱️ Estimated time for the next section\n", + "\n", + "Feel free to work at your own paceβ€”these are suggestions, not requirements! πŸš€\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 🎨 Technique Spotlight: Strategic Combinations\n", + "\n", + "Here's how Module 2 tactics combine to solve real SDLC challenges:\n", + "\n", + "| **Technique** | **Purpose in SDLC Context** | **Prompting Tip** |\n", + "|---------------|----------------------------|-------------------|\n", + "| **Task Decomposition** | Break multifaceted engineering tasks (e.g., review + test suggestions) into manageable parts | Structure prompt into numbered steps or XML blocks (e.g., ``, ``) |\n", + "| **Role Prompting** | Align the model's persona with engineering expectations (e.g., \"Senior Backend Engineer\") | Specify domain, experience level, and evaluation criteria |\n", + "| **Chain-of-Thought** | Ensure reasoning is visible, aiding traceability and auditing | Request structured reasoning before conclusions, optionally hidden using \"inner monologue\" tags |\n", + "| **LLM-as-Judge** | Evaluate code changes or generated artifacts against standards | Provide rubric with weighted criteria and evidence requirement |\n", + "| **Few-Shot Examples** | Instill preferred review tone, severity labels, or test formats | Include short exemplars with both input (``, ``) and expected reasoning |\n", + "| **Prompt Templates** | Reduce prompt drift across teams and tools | Parameterize sections (`{{code_diff}}`, `{{requirements}}`) for consistent reuse |\n", + "\n", + "#### πŸ”— The Power of Strategic Combinations\n", + "\n", + "The real skill isn't using tactics in isolationβ€”it's knowing **which combinations solve which problems**. Each section demonstrates a different combination pattern optimized for specific SDLC challenges.\n", + "\n", + "Ready to build production-ready solutions? Let's dive in! πŸ‘‡\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ” Section 1: Code Review Automation Template\n", + "\n", + "### Building a Comprehensive Code Review Prompt with Multi-Tactic Combination\n", + "\n", + "
    \n", + "🎯 What You'll Build in This Section

    \n", + "\n", + "You'll create a **production-ready code review prompt template** that automatically analyzes code changes with the rigor of a senior engineer. This isn't just about finding bugs but rather you're building a system that provides consistent, traceable, and actionable feedback.\n", + "\n", + "**Time Required:** ~40 minutes (includes building, testing, and refining the template)\n", + "
    \n", + "\n", + "#### πŸ“‹ Before You Start: What You'll Need\n", + "\n", + "To get the most from this section, have ready:\n", + "\n", + "1. **A code diff to review** (options):\n", + " - A recent pull request from your repository\n", + " - Sample code provided in the activities below\n", + " - Any Python, JavaScript, or Java code change you want analyzed\n", + "\n", + "2. **Clear review criteria** for your domain:\n", + " - What counts as a \"blocker\" vs \"minor\" issue in your team?\n", + " - Which security patterns should be enforced?\n", + " - What performance thresholds matter for your application?\n", + "\n", + "3. **Your API connection** set up and tested (from the setup section above)\n", + "\n", + "
    \n", + "πŸ’‘ Why This Approach Works with Modern LLMs

    \n", + "\n", + "This template follows industry best practices for prompt engineering with advanced language models. According to [Claude 4 prompt engineering best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices), modern LLMs excel when you:\n", + "\n", + "- **Be explicit about expectations** - We'll define exactly what constitutes each severity level\n", + "- **Provide context for behavior** - Explain *why* certain patterns are problematic (e.g., \"SQL injection vulnerabilities allow attackers to access sensitive data\")\n", + "- **Use structured formats** - XML tags help models maintain focus across complex multi-step analyses\n", + "- **Encourage visible reasoning** - Chain-of-thought reveals the \"why\" behind each finding, making reviews auditable\n", + "\n", + "These aren't arbitrary choicesβ€”they directly address how advanced language models process instructions most effectively, ensuring consistent results across different AI providers.\n", + "
    \n", + "\n", + "#### 🎯 The Problem We're Solving\n", + "\n", + "Manual code reviews face three critical challenges:\n", + "\n", + "1. **⏰ Time Bottlenecks** \n", + " - Senior engineers spend 8-12 hours/week reviewing PRs\n", + " - Review queues delay feature delivery by 2-3 days on average\n", + " - **Impact:** Slower velocity, frustrated developers\n", + "\n", + "2. **🎯 Inconsistent Standards**\n", + " - Different reviewers prioritize different concerns\n", + " - New team members lack institutional knowledge\n", + " - Review quality varies based on reviewer fatigue\n", + " - **Impact:** Technical debt accumulates, security gaps emerge\n", + "\n", + "3. **πŸ“ Lost Knowledge**\n", + " - Review reasoning buried in PR comments\n", + " - No searchable audit trail for security decisions\n", + " - Hard to train junior developers on review standards\n", + " - **Impact:** Repeated mistakes, difficult compliance auditing\n", + "\n", + "#### ✨ Understanding Prompt Templates\n", + "\n", + "According to [prompt templating best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables), effective prompts separate **fixed content** (static instructions) from **variable content** (dynamic inputs). This separation enables:\n", + "\n", + "**Key Benefits:**\n", + "- **Consistency** - Same review standards applied every time\n", + "- **Efficiency** - Swap inputs without rewriting instructions\n", + "- **Testability** - Quickly test different code diffs\n", + "- **Scalability** - Manage complexity as your application grows\n", + "- **Version Control** - Track changes to prompt logic separately from data\n", + "\n", + "**How to Templatize:**\n", + "1. **Identify fixed content** - Instructions that never change (e.g., \"Act as a Senior Backend Engineer\")\n", + "2. **Identify variable content** - Dynamic data that changes per request (e.g., code diffs, repository names)\n", + "3. **Use placeholders** - Mark variables with `{{double_brackets}}` for easy identification\n", + "4. **Separate concerns** - Keep prompt logic in templates, data in variables\n", + "\n", + "**Example:**\n", + "```\n", + "Fixed: \"Review this code for security issues\"\n", + "Variable: {{code_diff}} ← Changes with each API call\n", + "Template: \"Review this code for security issues: {{code_diff}}\"\n", + "```\n", + "\n", + "#### πŸ—οΈ How We'll Build It: The Tactical Combination\n", + "\n", + "This template strategically combines five Module 2 tactics:\n", + "\n", + "| **Tactic** | **Purpose in This Template** | **Why Modern LLMs Need This** |\n", + "|------------|------------------------------|-------------------------------|\n", + "| **Role Prompting** | Establishes \"Senior Backend Engineer\" perspective with specific expertise | LLMs respond better when given explicit expertise context rather than assuming generic knowledge |\n", + "| **Structured Inputs (XML)** | Separates code, context, and guidelines into clear sections | Prevents models from mixing different information types during analysis |\n", + "| **Task Decomposition** | Breaks review into 4 sequential steps (Think β†’ Assess β†’ Suggest β†’ Verdict) | Advanced LLMs excel at following explicit numbered steps rather than implicit workflows |\n", + "| **Chain-of-Thought** | Makes reasoning visible in Analysis section | Improves accuracy by forcing deliberate analysis before conclusions |\n", + "| **Structured Output** | Uses readable markdown format with severity levels | Enables human readability while maintaining parseable structure for automation |\n", + "\n", + "
    \n", + "πŸš€ Let's Build It!

    \n", + "\n", + "In the next cell, you'll see the complete template structure. **Pay special attention to**:\n", + "- How we use explicit language to define severity levels (not \"bad code\" but \"allows SQL injection\")\n", + "- Why the markdown output format is more readable than XML while still being parseable\n", + "- How parameters like `{{tech_stack}}` and `{{change_purpose}}` make the template reusable across projects\n", + "- How the 6 review dimensions (Security, Performance, Error Handling, etc.) ensure comprehensive analysis\n", + "\n", + "After reviewing the template, you'll test it on real code and see how each tactic contributes to the result.\n", + "
    \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“‹ Template Structure\n", + "\n", + "```xml\n", + "\n", + "Act as a Senior Backend Engineer specializing in {{tech_stack}}.\n", + "\n", + "\n", + "\n", + "Repository: {{repo_name}}\n", + "Service: {{service_name}}\n", + "Purpose: {{change_purpose}}\n", + "\n", + "\n", + "\n", + "{{code_diff}}\n", + "\n", + "\n", + "\n", + "Evaluate the code across these critical dimensions:\n", + "\n", + "1. **Security**: Check for vulnerabilities (SQL injection, XSS, insecure dependencies, exposed secrets)\n", + "2. **Performance**: Identify bottlenecks (N+1 queries, memory leaks, inefficient algorithms)\n", + "3. **Error Handling**: Validate proper exception handling and edge case coverage\n", + "4. **Code Quality**: Assess readability, simplicity, and adherence to standards\n", + "5. **Correctness**: Verify logic achieves intended functionality\n", + "6. **Maintainability**: Check for unnecessary complexity or dependencies\n", + "\n", + "For each finding:\n", + "- Cite exact lines using git diff markers\n", + "- Explain why it's problematic (impact on users, security, or system)\n", + "- If code is acceptable, confirm with specific justification\n", + "\n", + "\n", + "\n", + "Step 1 - Think: Analyze the code systematically using chain-of-thought reasoning in the Analysis section.\n", + " Consider:\n", + " β€’ What could go wrong with this code?\n", + " β€’ Are there security implications?\n", + " β€’ How does this perform at scale?\n", + " β€’ Are edge cases handled?\n", + "\n", + "Step 2 - Assess: For each issue identified, provide:\n", + " β€’ Severity: \n", + " - BLOCKER: Security vulnerabilities, data loss risks, critical bugs\n", + " - MAJOR: Performance issues, poor error handling, significant technical debt\n", + " - MINOR: Code style inconsistencies, missing comments, small optimizations\n", + " - NIT: Formatting, naming conventions, trivial improvements\n", + " β€’ Description: What is the issue and why it matters\n", + " β€’ Evidence: Specific line numbers and code excerpts\n", + " β€’ Impact: Potential consequences (security risk, performance degradation, etc.)\n", + "\n", + "Step 3 - Suggest: Provide actionable remediation:\n", + " β€’ Specific code improvements or refactoring\n", + " β€’ Alternative approaches to consider\n", + " β€’ Questions for the author about design decisions\n", + "\n", + "Step 4 - Verdict: Conclude with clear decision:\n", + " β€’ Pass/Fail/Needs Discussion\n", + " β€’ Summary of key findings\n", + " β€’ Required actions before merge\n", + "\n", + "\n", + "\n", + "Provide your review in clear markdown format:\n", + "\n", + "## 🧠 Analysis\n", + "[Your reasoning about potential issues - what patterns concern you?]\n", + "\n", + "## πŸ” Findings\n", + "\n", + "### [SEVERITY] Issue Title\n", + "**Lines:** [specific line numbers]\n", + "**Problem:** [what's wrong and why it matters]\n", + "**Impact:** [consequences - security risk, performance, etc.]\n", + "**Fix:** [specific recommendation or code suggestion]\n", + "\n", + "[Repeat for each issue found]\n", + "\n", + "## βœ… Verdict\n", + "**Decision:** [PASS / FAIL / NEEDS_DISCUSSION]\n", + "**Summary:** [Brief overview of review]\n", + "**Required Actions:** [What must be done before merge]\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### 🎯 What Makes This Production-Ready?\n", + "\n", + "βœ… **Comprehensive Review Dimensions** - Covers Security, Performance, Error Handling, Code Quality, Correctness, and Maintainability (not just \"find bugs\")\n", + "\n", + "βœ… **Clear Severity Definitions** - Explicit criteria for BLOCKER/MAJOR/MINOR/NIT classifications prevent ambiguity\n", + "\n", + "βœ… **Impact Analysis** - Every finding explains *why* it matters (security risk, performance degradation, maintainability issues)\n", + "\n", + "βœ… **Actionable Guidance** - Prompts for specific code improvements, not vague suggestions\n", + "\n", + "βœ… **Decision Framework** - Pass/Fail/Needs Discussion verdict with required actions before merge\n", + "\n", + "βœ… **Readable Output Format** - Uses clean markdown instead of verbose XML for better human readability and easier integration with PR tools\n", + "\n", + "These additions ensure reviews are consistent, auditable, and aligned with production quality standards.\n", + "\n", + "---\n", + "\n", + "### πŸ’» Working Example: Reviewing a Security Vulnerability\n", + "\n", + "Let's apply our enhanced template to a real-world scenario - a code change that introduces a SQL injection vulnerability.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Security-Focused Code Review with Enhanced Template\n", + "code_diff = \"\"\"\n", + "+ def get_user_by_email(email):\n", + "+ query = f\"SELECT * FROM users WHERE email = '{email}'\"\n", + "+ cursor.execute(query)\n", + "+ return cursor.fetchone()\n", + "\"\"\"\n", + "\n", + "messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Senior Security Engineer specializing in application security and OWASP Top 10 vulnerabilities.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: user-service-api\n", + "Service: Authentication Service\n", + "Purpose: Add email-based user lookup for login feature\n", + "Security Context: This service handles sensitive user authentication data and is exposed to external API requests\n", + "\n", + "\n", + "\n", + "{code_diff}\n", + "\n", + "\n", + "\n", + "Evaluate the code with emphasis on security vulnerabilities, following [AWS security scanning best practices](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/security-scan.md):\n", + "\n", + "**Primary Focus - Security:**\n", + "- OWASP Top 10 vulnerabilities (Injection, Authentication, XSS, etc.)\n", + "- Input validation and sanitization\n", + "- Authentication and authorization flaws\n", + "- Sensitive data exposure\n", + "- Known CVE/CWE patterns\n", + "\n", + "**Secondary Considerations:**\n", + "- Performance implications of security fixes\n", + "- Error handling (avoid information leakage)\n", + "- Code quality and maintainability\n", + "- Correctness of implementation\n", + "\n", + "For each security finding:\n", + "- Identify the vulnerability type and CWE/CVE reference if applicable\n", + "- Cite exact lines using git diff markers\n", + "- Explain the attack vector and potential impact\n", + "- Provide secure coding remediation with examples\n", + "\n", + "\n", + "\n", + "Step 1 - Security Analysis: Systematically analyze for vulnerabilities in the Analysis section.\n", + " Consider:\n", + " β€’ What attack vectors exist in this code?\n", + " β€’ Which OWASP Top 10 categories apply?\n", + " β€’ What is the blast radius if exploited?\n", + " β€’ Are there any CWE patterns present?\n", + "\n", + "Step 2 - Vulnerability Assessment: For each security issue, provide:\n", + " β€’ Severity (Security-focused): \n", + " - CRITICAL: Remote code execution, authentication bypass, SQL injection allowing data exfiltration\n", + " - HIGH: Privilege escalation, XSS, insecure deserialization, significant data exposure\n", + " - MEDIUM: Information disclosure, missing security headers, weak encryption\n", + " - LOW: Security misconfigurations with limited impact, verbose error messages\n", + " β€’ Vulnerability Type: (e.g., \"SQL Injection - CWE-89\")\n", + " β€’ OWASP Category: (e.g., \"A03:2021 - Injection\")\n", + " β€’ Evidence: Specific vulnerable code with line numbers\n", + " β€’ Attack Scenario: How an attacker could exploit this\n", + " β€’ Impact: Data breach potential, system compromise, compliance violations\n", + "\n", + "Step 3 - Security Remediation: Provide secure alternatives:\n", + " β€’ Specific secure code implementation\n", + " β€’ Reference to security libraries/frameworks (e.g., parameterized queries, ORM)\n", + " β€’ Defense-in-depth recommendations\n", + " β€’ Security testing suggestions\n", + "\n", + "Step 4 - Security Verdict: Conclude with risk assessment:\n", + " β€’ Decision: BLOCK / FIX_REQUIRED / NEEDS_SECURITY_REVIEW / APPROVE_WITH_CONDITIONS\n", + " β€’ Risk Summary: Overall security posture assessment\n", + " β€’ Required Actions: Security fixes that must be implemented before deployment\n", + "\n", + "\n", + "\n", + "Provide your security review in clear markdown format:\n", + "\n", + "## πŸ”’ Security Analysis\n", + "[Your reasoning about security vulnerabilities - what attack vectors exist?]\n", + "\n", + "## 🚨 Security Findings\n", + "\n", + "### [SEVERITY] Vulnerability Type - CWE-XXX\n", + "**Lines:** [specific line numbers]\n", + "**OWASP Category:** [e.g., A03:2021 - Injection]\n", + "**Vulnerability:** [description of the security flaw]\n", + "**Attack Scenario:** [how an attacker exploits this]\n", + "**Impact:** [data breach, system compromise, compliance violation]\n", + "**Secure Fix:** [specific code solution with security best practices]\n", + "\n", + "[Repeat for each vulnerability found]\n", + "\n", + "## βœ… Security Verdict\n", + "**Risk Level:** [CRITICAL / HIGH / MEDIUM / LOW]\n", + "**Decision:** [BLOCK / FIX_REQUIRED / NEEDS_SECURITY_REVIEW / APPROVE_WITH_CONDITIONS]\n", + "**Summary:** [Overall security assessment]\n", + "**Required Security Actions:** [Must-fix items before deployment]\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"πŸ”’ SECURITY-FOCUSED CODE REVIEW IN PROGRESS...\")\n", + "print(\"=\"*70)\n", + "review_result = get_chat_completion(messages, temperature=0.0)\n", + "print(review_result)\n", + "print(\"=\"*70)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ‹οΈ Activity: Build Your Own Code Review Template\n", + "\n", + "
    \n", + "⏱️ Time Required: 35-50 minutes
    \n", + "This is a hands-on research and build activity. You'll explore professional code review patterns and create your own template.\n", + "
    \n", + "\n", + "#### πŸ“– What You'll Do\n", + "\n", + "This activity challenges you to **research, design, and build** a production-ready code review template by studying real-world patterns from AWS.\n", + "\n", + "#### πŸ“‹ Instructions\n", + "\n", + "Follow the **3-step process** in the code cell below:\n", + "\n", + "1. **RESEARCH (10-15 min)** - Study the AWS code review pattern and identify key elements\n", + "2. **DESIGN (10-15 min)** - Answer design questions to plan your template structure \n", + "3. **BUILD (15-20 min)** - Implement your template by adapting the starter code\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "\n", + "### πŸ“‹ STEP 1 - RESEARCH (10-15 minutes)\n", + "\n", + "**πŸ“– READ THE AWS CODE REVIEW PATTERN:**\n", + " \n", + "πŸ‘‰ [AWS Anthropic Code Review Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md)\n", + "\n", + "**πŸ” KEY THINGS TO LOOK FOR:**\n", + "- βœ“ How do they structure code review prompts?\n", + "- βœ“ What review dimensions do they cover? (Security, Performance, Quality, etc.)\n", + "- βœ“ What severity levels do they use and how are they defined?\n", + "- βœ“ What output format do they recommend?\n", + "- βœ“ How do they ensure actionable feedback?\n", + "\n", + "
    \n", + "\n", + "
    \n", + "\n", + "### πŸ’­ STEP 2 - DESIGN YOUR TEMPLATE (10-15 minutes)\n", + "\n", + "**ANSWER THESE QUESTIONS BEFORE CODING:**\n", + "\n", + "**1️⃣ ROLE:** What expertise should the AI have?\n", + " - πŸ’‘ *Hint: This is a Python authentication function - what type of engineer should review it?*\n", + "\n", + "**2️⃣ CONTEXT:** What information helps the AI understand the code?\n", + " - Repository and service name?\n", + " - Purpose of the code change?\n", + " - Technology stack specifics?\n", + " - Security requirements?\n", + "\n", + "**3️⃣ REVIEW DIMENSIONS:** What aspects should be evaluated?\n", + " \n", + " Consider the 6 dimensions from earlier in this notebook:\n", + " - **Security** (SQL injection, password handling, input validation)\n", + " - **Performance** (database queries, caching)\n", + " - **Error Handling** (exceptions, edge cases)\n", + " - **Code Quality** (readability, maintainability)\n", + " - **Correctness** (authentication logic)\n", + " - **Best Practices** (Python idioms, security standards)\n", + "\n", + "**4️⃣ OUTPUT FORMAT:** How should findings be presented?\n", + " - Markdown vs XML?\n", + " - What sections are needed?\n", + " - How to structure individual findings?\n", + " - What makes feedback actionable?\n", + "\n", + "
    \n", + "\n", + "
    \n", + "\n", + "### πŸ”¨ STEP 3 - BUILD YOUR TEMPLATE (15-20 minutes)\n", + "\n", + "**YOUR TASK:**\n", + "\n", + "⚠️ **Edit the starter template in the code cell below** by replacing all *TODO* sections with your own design based on your research in Steps 1 & 2.\n", + "\n", + "The starter template provides the basic structure - you need to enhance it by:\n", + "1. Improving the role definition\n", + "2. Adding relevant context\n", + "3. Expanding review guidelines with specific checks\n", + "4. Structuring tasks with clear steps\n", + "5. Designing an effective output format\n", + "\n", + "**πŸ’‘ TIP:** Look at the complete examples in Cells 9 and 11 to see how all pieces fit together!\n", + "\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ╔══════════════════════════════════════════════════════════════════════════════╗\n", + "# β•‘ PRACTICE ACTIVITY CODE - Follow Steps 1-3 in the markdown cell above β•‘\n", + "# β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "# ╔══════════════════════════════════════════════════════════════════════════════╗\n", + "# β•‘ CODE TO REVIEW: Python authentication function with multiple security issues β•‘\n", + "# β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "practice_code = \"\"\"\n", + "+ import hashlib\n", + "+ \n", + "+ def authenticate_user(username, password):\n", + "+ # Connect to database\n", + "+ query = \"SELECT * FROM users WHERE username = '\" + username + \"'\"\n", + "+ user = db.execute(query)\n", + "+ \n", + "+ # Hash the password\n", + "+ hashed = hashlib.md5(password.encode()).hexdigest()\n", + "+ \n", + "+ # Check password\n", + "+ if user['password'] == hashed:\n", + "+ return user\n", + "+ return None\n", + "\"\"\"\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# ⚠️ STARTER TEMPLATE - EDIT ALL TODO SECTIONS BELOW ⚠️\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# This is a basic template to get you started. Your task is to enhance it by:\n", + "# 1. Improving the role definition\n", + "# 2. Adding relevant context\n", + "# 3. Expanding review guidelines with specific checks\n", + "# 4. Structuring tasks with clear steps\n", + "# 5. Designing an effective output format\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "practice_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " # ⚠️ TODO: Change this role based on what you learned from AWS patterns\n", + " # πŸ’‘ Hint: What expertise is needed to review authentication code?\n", + " # Consider: \"Security Engineer\"? \"Senior Backend Engineer\"?\n", + " \"content\": \"You are a Senior Software Engineer.\"\n", + " },\n", + " {\n", + " \"role\": \"user\", \n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: user-authentication-service\n", + "Service: Authentication API\n", + "Purpose: Add user login authentication endpoint\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "{practice_code}\n", + "\n", + "\n", + "\n", + "Evaluate the code across these critical dimensions:\n", + "\n", + "1. **Security**: Check for vulnerabilities (SQL injection, weak hashing, input validation)\n", + "2. **Performance**: Identify scalability issues (database queries, caching opportunities)\n", + "3. **Error Handling**: Validate exception handling (try-catch, edge cases)\n", + "4. **Code Quality**: Assess readability and maintainability\n", + "5. **Correctness**: Verify authentication logic works as intended\n", + "6. **Best Practices**: Check Python and security standards (OWASP guidelines)\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Step 1 - Analyze: Systematically examine code for issues across all dimensions\n", + " [⚠️ Add specific analysis questions here - what should the LLM consider?]\n", + "\n", + "Step 2 - Assess: For each issue found, provide:\n", + " [⚠️ Define severity levels with concrete criteria]\n", + " [⚠️ Specify what evidence is needed]\n", + " [⚠️ Explain how to describe impact]\n", + "\n", + "Step 3 - Recommend: Provide actionable fixes\n", + " [⚠️ Define what makes recommendations actionable]\n", + "\n", + "Step 4 - Verdict: Conclude with clear decision\n", + " [⚠️ Specify decision format and required summary]\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Provide your review in clear format:\n", + "\n", + "## [⚠️ Your Analysis Section Name]\n", + "[⚠️ What goes here? Think about chain-of-thought reasoning]\n", + "\n", + "## [⚠️ Your Findings Section Name]\n", + "[⚠️ How are issues structured? What information is essential?]\n", + "\n", + "### [SEVERITY] Issue Title\n", + "**Lines:** [⚠️ Specify what line information is needed]\n", + "**Problem:** [⚠️ Define how to explain the issue]\n", + "**Impact:** [⚠️ Define how to explain consequences]\n", + "**Fix:** [⚠️ Define how to provide recommendations]\n", + "\n", + "## [⚠️ Your Verdict Section Name]\n", + "[⚠️ What final information helps decision-making?]\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# πŸ§ͺ TEST YOUR TEMPLATE - Uncomment when you've completed all TODO sections\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "# print(\"πŸ” TESTING YOUR CODE REVIEW TEMPLATE\")\n", + "# print(\"=\"*70)\n", + "# result = get_chat_completion(practice_messages, temperature=0.0)\n", + "# print(result)\n", + "# print(\"=\"*70)\n", + "\n", + "print(\"\"\"\n", + "╔══════════════════════════════════════════════════════════════════════════════╗\n", + "β•‘ πŸ’‘ HINTS FOR SUCCESS β•‘\n", + "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "πŸ“‹ CODE REVIEW ELEMENTS TO INCLUDE:\n", + " βœ“ Clear severity definitions: BLOCKER (security vulnerabilities), MAJOR, MINOR, NIT\n", + " βœ“ Evidence citations: Line numbers and specific code excerpts\n", + " βœ“ Impact explanation: Why the issue matters (security breach, data loss, etc.)\n", + " βœ“ Actionable recommendations: Specific code fixes with secure alternatives\n", + " βœ“ Reasoning transparency: Include analysis section showing thought process\n", + "\n", + "⚠️ CRITICAL ISSUES IN THIS AUTHENTICATION CODE:\n", + " β€’ SQL Injection vulnerability (line 5 - string concatenation in query)\n", + " β€’ Weak password hashing (line 8 - MD5 is cryptographically broken)\n", + " β€’ Missing error handling (no try-catch, no validation)\n", + " β€’ Information leakage (no distinction between \"user not found\" vs \"wrong password\")\n", + " β€’ No input validation (username/password could be empty, malicious)\n", + " β€’ Missing security best practices (no rate limiting, no password complexity)\n", + "\n", + "❓ SELF-CHECK QUESTIONS:\n", + " β†’ Does my prompt cover all 6 review dimensions?\n", + " β†’ Does it prioritize security issues appropriately for authentication code?\n", + " β†’ Does it request specific evidence (line numbers, vulnerable code excerpts)?\n", + " β†’ Does it ask for secure code examples (parameterized queries, bcrypt)?\n", + " β†’ Are severity levels well-defined with concrete security impact?\n", + " β†’ Does it use a clear, readable output format (markdown recommended)?\n", + "\n", + "╔══════════════════════════════════════════════════════════════════════════════╗\n", + "β•‘ 🎯 NEXT CHALLENGES β•‘\n", + "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "After creating your template, extend your learning:\n", + " \n", + " 1️⃣ Test it on different code samples (frontend, backend, different languages)\n", + " 2️⃣ Create specialized variants:\n", + " β€’ Security-only (reference: AWS security-scan.md pattern)\n", + " β€’ Performance-only (reference: AWS analyze-performance.md pattern)\n", + " 3️⃣ Compare with the complete examples in Cells 9 and 11\n", + " β€’ What did you do similarly? Differently?\n", + " β€’ Which approach works better for your use case?\n", + "\n", + "πŸ“š REFERENCE CELLS:\n", + " β€’ Cell 9: General code review template with markdown output\n", + " β€’ Cell 11: Security-focused review example with OWASP categories\n", + " β€’ Cell 14: The 3-step process for this practice activity\n", + "\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### πŸ“š Learn More: Advanced Code Review Patterns\n", + "\n", + "Want to dive deeper into production code review automation? Explore these resources:\n", + "\n", + "**πŸ“– AWS Anthropic Advanced Patterns**\n", + "- [Code Review Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/code-review.md) - Production-ready patterns for AI-powered code review\n", + "- Covers advanced topics like multi-file reviews, security-focused analysis, and CI/CD integration\n", + "\n", + "**πŸ”— Related Best Practices**\n", + "- [Claude 4 Prompt Engineering Best Practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) - Core prompting techniques\n", + "- [Prompt Templates and Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables) - Parameterization strategies\n", + "\n", + "**πŸ’‘ What You Can Build Next:**\n", + "- Integrate this template into your CI/CD pipeline (GitHub Actions, GitLab CI)\n", + "- Create specialized variants (security-only reviews, performance-only reviews)\n", + "- Build a review bot that automatically comments on pull requests\n", + "- Develop custom severity criteria tailored to your team's standards\n", + "\n", + "
    \n", + "🎯 Pro Tip: The AWS patterns repository includes examples of integrating these templates with AWS Lambda, CodeCommit, and other cloud services. Great for enterprise deployments!\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "
    \n", + "

    β˜• Suggested Break Point #1

    \n", + "

    ~40 minutes elapsed

    \n", + "
    \n", + " \n", + "
    \n", + "

    βœ… Completed:

    \n", + "
      \n", + "
    • Section 1: Code Review Automation Template
    • \n", + "
    • Built production-ready code review prompts
    • \n", + "
    • Practiced with security vulnerability detection
    • \n", + "
    • Reviewed React component for performance issues
    • \n", + "
    \n", + "
    \n", + " \n", + "
    \n", + "

    ⏭️ Coming Next:

    \n", + "
      \n", + "
    • Section 2: Test Case Generation Template
    • \n", + "
    • Coverage gap identification
    • \n", + "
    • Smart test plan creation
    • \n", + "
    \n", + "

    ⏱️ Next section: ~30-35 minutes

    \n", + "
    \n", + " \n", + "
    \n", + "

    πŸ“Œ BOOKMARK TO RESUME:

    \n", + "

    \"Section 2: Test Case Generation Template\"

    \n", + "
    \n", + " \n", + "

    \n", + " πŸ’‘ This is a natural stopping point. Feel free to take a break and return later!\n", + "

    \n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ§ͺ Section 2: Test Generation Automation Template\n", + "\n", + "### Building a Comprehensive Test Generation Prompt with Multi-Tactic Combination\n", + "\n", + "
    \n", + "🎯 What You'll Build in This Section

    \n", + "\n", + "You'll create a **production-ready test generation prompt template** that automatically produces comprehensive test suites by analyzing requirements and identifying coverage gaps. This isn't just about writing happy-path testsβ€”you're building a system that uncovers edge cases, flags ambiguities, and produces actionable test specifications.\n", + "\n", + "**Time Required:** ~40 minutes (includes building, testing, and refining the template)\n", + "
    \n", + "\n", + "#### πŸ“‹ Before You Start: What You'll Need\n", + "\n", + "To get the most from this section, have ready:\n", + "\n", + "1. **Requirements to test** (options):\n", + " - A feature spec from your current sprint\n", + " - User stories with acceptance criteria\n", + " - Sample requirements provided in the activities below\n", + " - Any vague or ambiguous requirements that need clarification\n", + "\n", + "2. **Context about your test strategy**:\n", + " - What test types does your team write? (unit, integration, E2E)\n", + " - What test framework do you use? (pytest, Jest, JUnit)\n", + " - What makes a good test specification in your workflow?\n", + "\n", + "3. **Your API connection** set up and tested (from the setup section above)\n", + "\n", + "
    \n", + "πŸ’‘ Why This Approach Works with Modern LLMs

    \n", + "\n", + "This template follows industry best practices for prompt engineering with advanced language models. According to [Claude 4 prompt engineering best practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices), modern LLMs excel when you:\n", + "\n", + "- **Structure the analysis process** - We'll decompose test generation into clear steps: analyze requirements β†’ identify gaps β†’ generate specs β†’ document infrastructure needs\n", + "- **Request explicit reasoning** - Chain-of-thought helps the model explain *why* certain edge cases matter (e.g., \"Testing expiration at midnight requires timezone handling\")\n", + "- **Use systematic frameworks** - Categorizing tests by type (unit/integration) and coverage dimension (happy path/edge case/error path) produces more thorough results\n", + "- **Flag ambiguities proactively** - Encouraging the model to question unclear requirements prevents wasted testing effort on wrong assumptions\n", + "\n", + "These aren't arbitrary choicesβ€”they directly address how advanced language models process instructions most effectively, ensuring comprehensive test coverage across different AI providers.\n", + "
    \n", + "\n", + "#### 🎯 The Problem We're Solving\n", + "\n", + "Manual test planning faces three critical challenges:\n", + "\n", + "1. **πŸ“‹ Incomplete Coverage**\n", + " - Easy to miss edge cases and error paths\n", + " - Boundary conditions often overlooked (0%, 100%, empty inputs)\n", + " - Security and performance test scenarios forgotten\n", + " - **Impact:** Bugs slip through to production, customer trust erodes\n", + "\n", + "2. **⏰ Time Pressure**\n", + " - Testing gets squeezed at the end of sprints\n", + " - QA teams struggle to keep up with feature velocity\n", + " - Test planning rushed, documentation minimal\n", + " - **Impact:** Technical debt in test suites, maintenance nightmares\n", + "\n", + "3. **🎲 Missed Ambiguities**\n", + " - Unclear requirements don't get questioned until implementation\n", + " - Assumptions made without validation\n", + " - Integration points and dependencies discovered late\n", + " - **Impact:** Rework, missed deadlines, scope creep\n", + "\n", + "#### πŸ—οΈ How We'll Build It: The Tactical Combination\n", + "\n", + "| Tactic | Purpose | Implementation |\n", + "|--------|---------|----------------|\n", + "| **Role Prompting** | Assign QA expertise | \"You are a QA Automation Lead with expertise in {{tech_stack}}\" |\n", + "| **Structured Inputs** | Organize requirements & existing tests | XML tags: ``, `` |\n", + "| **Task Decomposition** | Break down test generation process | Numbered steps: Analyze β†’ Identify Gaps β†’ Generate Tests β†’ Document Dependencies |\n", + "| **Chain-of-Thought** | Encourage reasoning about coverage | Request explicit analysis of gaps and ambiguities |\n", + "| **Structured Output** | Enable automation | Markdown format with sections for different test types |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“‹ Test Generation Template Structure\n", + "\n", + "
    \n", + "πŸ”¨ Let's Build It

    \n", + "\n", + "We'll construct this template by:\n", + "1. **Defining the QA role** with specific tech stack expertise\n", + "2. **Structuring inputs** using XML tags for requirements and existing test context\n", + "3. **Decomposing the task** into: Analyze β†’ Identify Gaps β†’ Generate Tests β†’ Document Dependencies\n", + "4. **Requesting chain-of-thought** for coverage analysis\n", + "5. **Specifying markdown output** for test plans (replacing verbose XML with readable format)\n", + "6. **Adding parameters** (`{{tech_stack}}`, `{{requirements}}`, `{{existing_tests}}`) for reusability\n", + "\n", + "This template draws inspiration from [AWS Anthropic test generation patterns](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md), adapted for clarity and automation.\n", + "
    \n", + "\n", + "```xml\n", + "\n", + "You are a QA Automation Lead with expertise in {{tech_stack}}.\n", + "\n", + "\n", + "\n", + "{{functional_requirements}}\n", + "\n", + "\n", + "\n", + "{{test_suite_overview}}\n", + "\n", + "\n", + "\n", + "1. Analyze the requirements and existing test coverage\n", + "2. Identify coverage gaps across these dimensions:\n", + " - Missing scenarios (happy paths, edge cases, error paths)\n", + " - Business rule validation\n", + " - Data boundary conditions\n", + " - Concurrent/async behavior\n", + " - Security concerns (auth, input validation)\n", + " - Performance considerations\n", + "\n", + "3. For each identified gap, generate test specifications including:\n", + " - Test name (descriptive, follows naming conventions)\n", + " - Purpose (what does this test verify?)\n", + " - Test type (unit, integration, e2e)\n", + " - Preconditions (required setup, test data, mocks)\n", + " - Steps (execution sequence)\n", + " - Expected outcome (assertions, success criteria)\n", + "\n", + "4. Categorize tests by type and document dependencies\n", + "\n", + "5. Flag ambiguities in requirements that need clarification\n", + "\n", + "\n", + "\n", + "Provide your test plan in clear markdown format:\n", + "\n", + "## πŸ” Analysis\n", + "[Your reasoning about requirements and existing coverage - what patterns do you see?]\n", + "\n", + "## ⚠️ Ambiguities\n", + "[Requirements that need clarification before testing]\n", + "\n", + "## πŸ“Š Coverage Gaps\n", + "[What's missing from current test suite?]\n", + "\n", + "## πŸ§ͺ Unit Tests\n", + "\n", + "### Test: [Descriptive Name]\n", + "**Purpose:** [What this test verifies]\n", + "**Preconditions:** [Required setup]\n", + "**Steps:**\n", + "1. [Action]\n", + "2. [Action]\n", + "**Expected:** [Success criteria]\n", + "\n", + "[Repeat for each unit test]\n", + "\n", + "## πŸ”— Integration Tests\n", + "\n", + "### Test: [Descriptive Name]\n", + "**Purpose:** [What this test verifies]\n", + "**Preconditions:** [Required setup]\n", + "**Steps:**\n", + "1. [Action]\n", + "2. [Action]\n", + "**Expected:** [Success criteria]\n", + "\n", + "[Repeat for each integration test]\n", + "\n", + "## πŸ› οΈ Test Infrastructure Needs\n", + "[Mocks, fixtures, test data, environment dependencies]\n", + "\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ’» Working Example: Payment Service Test Generation\n", + "\n", + "Let's generate comprehensive tests for a payment processing service.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Test Case Generation for Payment Service\n", + "\n", + "functional_requirements = \"\"\"\n", + "Payment Processing Requirements:\n", + "1. Process credit card payments with validation\n", + "2. Handle multiple currencies (USD, EUR, GBP)\n", + "3. Apply discounts and calculate tax\n", + "4. Generate transaction receipts\n", + "5. Handle payment failures and retries (max 3 attempts)\n", + "6. Send confirmation emails on success\n", + "7. Log all transactions for audit compliance\n", + "8. Support payment refunds within 30 days\n", + "\"\"\"\n", + "\n", + "existing_tests = \"\"\"\n", + "Current Test Suite (payment_service_test.py):\n", + "- test_process_valid_payment() - Happy path for USD payments\n", + "- test_invalid_card_number() - Validates card number format\n", + "- test_calculate_tax() - Tax calculation for US region only\n", + "\"\"\"\n", + "\n", + "test_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a QA Automation Lead with expertise in Python testing frameworks (pytest).\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "{functional_requirements}\n", + "\n", + "\n", + "\n", + "{existing_tests}\n", + "\n", + "\n", + "\n", + "1. Analyze the requirements and existing test coverage\n", + "2. Identify coverage gaps across these dimensions:\n", + " - Missing scenarios (happy paths, edge cases, error paths)\n", + " - Business rule validation\n", + " - Data boundary conditions\n", + " - Concurrent/async behavior\n", + " - Security concerns (auth, input validation)\n", + " - Performance considerations\n", + "\n", + "3. For each identified gap, generate test specifications including:\n", + " - Test name (descriptive, follows naming conventions)\n", + " - Purpose (what does this test verify?)\n", + " - Test type (unit, integration, e2e)\n", + " - Preconditions (required setup, test data, mocks)\n", + " - Steps (execution sequence)\n", + " - Expected outcome (assertions, success criteria)\n", + "\n", + "4. Categorize tests by type and document dependencies\n", + "\n", + "5. Flag ambiguities in requirements that need clarification\n", + "\n", + "\n", + "\n", + "Provide your test plan in clear markdown format:\n", + "\n", + "## πŸ” Analysis\n", + "[Your reasoning about requirements and existing coverage - what patterns do you see?]\n", + "\n", + "## ⚠️ Ambiguities\n", + "[Requirements that need clarification before testing]\n", + "\n", + "## πŸ“Š Coverage Gaps\n", + "[What's missing from current test suite?]\n", + "\n", + "## πŸ§ͺ Unit Tests\n", + "\n", + "### Test: [Descriptive Name]\n", + "**Purpose:** [What this test verifies]\n", + "**Preconditions:** [Required setup]\n", + "**Steps:**\n", + "1. [Action]\n", + "2. [Action]\n", + "**Expected:** [Success criteria]\n", + "\n", + "[Repeat for each unit test]\n", + "\n", + "## πŸ”— Integration Tests\n", + "\n", + "### Test: [Descriptive Name]\n", + "**Purpose:** [What this test verifies]\n", + "**Preconditions:** [Required setup]\n", + "**Steps:**\n", + "1. [Action]\n", + "2. [Action]\n", + "**Expected:** [Success criteria]\n", + "\n", + "[Repeat for each integration test]\n", + "\n", + "## πŸ› οΈ Test Infrastructure Needs\n", + "[Mocks, fixtures, test data, environment dependencies]\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"πŸ§ͺ TEST GENERATION IN PROGRESS...\")\n", + "print(\"=\"*70)\n", + "test_result = get_chat_completion(test_messages, temperature=0.0)\n", + "print(test_result)\n", + "print(\"=\"*70)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ‹οΈ Practice Activity: Build Your Own Test Generation Template\n", + "\n", + "
    \n", + "⏱️ Time Required: 35-50 minutes
    \n", + "This is a hands-on research and build activity. You'll explore professional test generation patterns and create your own template.\n", + "
    \n", + "\n", + "#### πŸ“– What You'll Do\n", + "\n", + "This activity challenges you to **research, design, and build** a production-ready test generation template by studying real-world patterns from AWS. You'll work with ambiguous requirements for a shopping cart discount system - a perfect scenario for showcasing comprehensive test planning.\n", + "\n", + "#### 🎯 Learning Objectives\n", + "\n", + "By completing this activity, you will:\n", + "- βœ… Learn how to research and adapt professional test generation patterns\n", + "- βœ… Understand how to identify coverage gaps and ambiguities in requirements\n", + "- βœ… Practice designing structured test plans with unit and integration tests\n", + "- βœ… Build a reusable template for automated test case generation\n", + "\n", + "#### πŸ“‹ The Scenario\n", + "\n", + "A product manager has provided vague requirements for a new feature:\n", + "\n", + "**Feature: Shopping Cart Discount System**\n", + "- Users can apply discount codes at checkout\n", + "- Some discounts are percentage-based, others are fixed amounts\n", + "- Discounts have expiration dates\n", + "- Some codes are one-time use, others unlimited\n", + "- Discounts can't be combined\n", + "\n", + "**Existing Test Coverage:**\n", + "```python\n", + "# Current test suite:\n", + "- test_apply_percentage_discount() # 10% off $100 cart\n", + "- test_fixed_amount_discount() # $5 off $50 cart\n", + "```\n", + "\n", + "**Your Challenge:** These requirements are intentionally vague! Your template should identify ambiguities, generate edge cases, and produce comprehensive test specifications.\n", + "\n", + "#### πŸ” Code Sample for Testing\n", + "\n", + "Below you'll find the discount system requirements with minimal existing coverage. Use this as your test case while building your template.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "\n", + "### πŸ“‹ STEP 1 - RESEARCH (10-15 minutes)\n", + "\n", + "**πŸ“– READ THE AWS TEST GENERATION PATTERN:**\n", + " \n", + "πŸ‘‰ [AWS Anthropic Test Generation Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md)\n", + "\n", + "**πŸ” KEY THINGS TO LOOK FOR:**\n", + "- βœ“ How do they structure test generation prompts?\n", + "- βœ“ What dimensions do they analyze? (happy paths, edge cases, error paths)\n", + "- βœ“ How do they handle ambiguous requirements?\n", + "- βœ“ What output format do they recommend for test specifications?\n", + "- βœ“ How do they categorize tests (unit vs integration)?\n", + "\n", + "
    \n", + "\n", + "
    \n", + "\n", + "### πŸ’­ STEP 2 - DESIGN YOUR TEMPLATE (10-15 minutes)\n", + "\n", + "**ANSWER THESE QUESTIONS BEFORE CODING:**\n", + "\n", + "**1️⃣ ROLE:** What expertise should the AI have?\n", + " - πŸ’‘ *Hint: This is an e-commerce discount system - what type of QA engineer should test it?*\n", + "\n", + "**2️⃣ INPUTS:** What information helps the AI generate comprehensive tests?\n", + " - Requirements document (the vague feature description)?\n", + " - Existing test coverage (what's already tested)?\n", + " - Tech stack context (Python/pytest, JavaScript/Jest)?\n", + " - Business rules to validate?\n", + "\n", + "**3️⃣ COVERAGE DIMENSIONS:** What aspects should be tested?\n", + " \n", + " Consider these test categories:\n", + " - **Happy Paths** (valid discount codes, successful applications)\n", + " - **Edge Cases** (boundary values: 0%, 100% discounts, $0.01 amounts)\n", + " - **Error Paths** (expired codes, invalid codes, already-used one-time codes)\n", + " - **Business Rules** (no combination, minimum cart value requirements)\n", + " - **Ambiguities** (What if discount > cart total? Case sensitivity?)\n", + "\n", + "**4️⃣ OUTPUT FORMAT:** How should test specifications be structured?\n", + " - Markdown vs XML?\n", + " - What fields per test? (name, purpose, preconditions, steps, expected)\n", + " - How to separate unit vs integration tests?\n", + " - How to flag ambiguities and infrastructure needs?\n", + "\n", + "
    \n", + "\n", + "
    \n", + "\n", + "### πŸ”¨ STEP 3 - BUILD YOUR TEMPLATE (15-20 minutes)\n", + "\n", + "**YOUR TASK:**\n", + "\n", + "⚠️ **Edit the starter template in the code cell below** by replacing all `TODO` sections with your own design based on your research in Steps 1 & 2.\n", + "\n", + "The starter template provides the basic structure - you need to enhance it by:\n", + "1. Improving the QA role definition\n", + "2. Adding complete requirements and existing test context\n", + "3. Expanding task steps for comprehensive coverage analysis\n", + "4. Designing an effective output format for test specifications\n", + "\n", + "**πŸ’‘ TIP:** Look at the complete example in Cell 20 to see how all pieces fit together!\n", + "\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "**πŸ“– Full Solution Reference:** \n", + "\n", + "After completing your template, you can compare your approach with [solutions/activity-3.3-test-generation-solution.md](solutions/activity-3.3-test-generation-solution.md) to see:\n", + "- A complete test generation template implementation\n", + "- How to identify ambiguities in requirements systematically\n", + "- Examples of comprehensive edge case coverage\n", + "- Sprint planning and TDD workflow integration\n", + "\n", + "
    \n", + "πŸ’‘ Remember: There's no single \"correct\" solution. The goal is to build a template that works for your specific testing needs. Focus on understanding the why behind each design decision rather than matching the solution exactly.\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ╔══════════════════════════════════════════════════════════════════════════════╗\n", + "# β•‘ PRACTICE ACTIVITY CODE - Follow Steps 1-3 in the markdown cell above β•‘\n", + "# β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# REQUIREMENTS: Shopping Cart Discount System (intentionally vague!)\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "discount_requirements = \"\"\"\n", + "Feature: Shopping Cart Discount System\n", + "\n", + "Requirements:\n", + "1. Users can apply discount codes at checkout\n", + "2. Discount types: percentage (10%, 25%, etc.) or fixed amount ($5, $20, etc.)\n", + "3. Each discount code has an expiration date\n", + "4. Usage limits: one-time use OR unlimited\n", + "5. Business rule: Discounts cannot be combined (one per order)\n", + "6. Cart total must be > 0 after discount applied\n", + "7. Fixed discounts cannot exceed cart total\n", + "\"\"\"\n", + "\n", + "existing_discount_tests = \"\"\"\n", + "Current test suite (minimal coverage):\n", + "- test_apply_percentage_discount() - 10% off $100 cart\n", + "- test_fixed_amount_discount() - $5 off $50 cart\n", + "\"\"\"\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# ⚠️ STARTER TEMPLATE - EDIT ALL TODO SECTIONS BELOW ⚠️\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# This template provides basic structure. Your task is to enhance it by:\n", + "# 1. Refining the QA role definition\n", + "# 2. Structuring comprehensive task steps\n", + "# 3. Designing an effective output format\n", + "# 4. Adding coverage dimensions and ambiguity detection\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "discount_test_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " # ⚠️ TODO: Refine this role based on what you learned from AWS patterns\n", + " # πŸ’‘ Hint: What specific QA expertise is needed for e-commerce testing?\n", + " # Consider: \"QA Automation Lead specializing in...\"?\n", + " \"content\": \"You are a QA Automation Lead specializing in e-commerce testing.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "{discount_requirements}\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "{existing_discount_tests}\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "1. Analyze requirements and identify ambiguities or missing specifications\n", + " [⚠️ Add guiding questions: What constitutes an ambiguity? How to flag them?]\n", + "\n", + "2. List coverage gaps in existing tests\n", + " [⚠️ Define dimensions: happy paths, edge cases, error paths, business rules]\n", + "\n", + "3. Generate comprehensive test cases\n", + " [⚠️ Specify what each test needs: name format, purpose, preconditions, steps, expected]\n", + "\n", + "4. Categorize tests by type\n", + " [⚠️ Define criteria: What makes a test \"unit\" vs \"integration\"?]\n", + "\n", + "5. Document test infrastructure needs\n", + " [⚠️ What should be flagged: mocks, fixtures, environment dependencies?]\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Provide your test plan in clear format:\n", + "\n", + "## [⚠️ Your Analysis Section Name]\n", + "[⚠️ What goes here? Think about requirement analysis and ambiguity detection]\n", + "\n", + "## [⚠️ Your Ambiguities Section Name]\n", + "[⚠️ How should unclear requirements be flagged?]\n", + "\n", + "## [⚠️ Your Coverage Gaps Section Name]\n", + "[⚠️ What information helps identify what's missing?]\n", + "\n", + "## [⚠️ Your Unit Tests Section Name]\n", + "[⚠️ How should individual unit tests be structured?]\n", + "\n", + "### Test: [Descriptive Name]\n", + "**Purpose:** [⚠️ Define what makes a good purpose statement]\n", + "**Preconditions:** [⚠️ What setup information is needed?]\n", + "**Steps:** [⚠️ How detailed should steps be?]\n", + "**Expected:** [⚠️ What makes expectations clear and testable?]\n", + "\n", + "## [⚠️ Your Integration Tests Section Name]\n", + "[⚠️ Similar structure to unit tests, but what distinguishes integration tests?]\n", + "\n", + "## [⚠️ Your Infrastructure Needs Section Name]\n", + "[⚠️ What test dependencies should be documented?]\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# πŸ§ͺ TEST YOUR TEMPLATE - Uncomment when you've completed all TODO sections\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "# print(\"πŸ§ͺ TESTING YOUR TEST GENERATION TEMPLATE\")\n", + "# print(\"=\"*70)\n", + "# discount_test_result = get_chat_completion(discount_test_messages, temperature=0.0)\n", + "# print(discount_test_result)\n", + "# print(\"=\"*70)\n", + "\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "# HINTS & GUIDANCE\n", + "#═══════════════════════════════════════════════════════════════════════════════\n", + "\n", + "print(\"\"\"\n", + "╔══════════════════════════════════════════════════════════════════════════════╗\n", + "β•‘ πŸ’‘ HINTS FOR SUCCESS β•‘\n", + "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "πŸ“‹ TEST GENERATION ELEMENTS TO INCLUDE:\n", + " βœ“ Ambiguity detection: Identify unclear or missing requirements\n", + " βœ“ Coverage dimensions: Happy paths, edge cases, error paths, business rules\n", + " βœ“ Test categorization: Clear distinction between unit and integration tests\n", + " βœ“ Comprehensive specs: Purpose, preconditions, steps, expected outcomes\n", + " βœ“ Infrastructure flagging: Mocks, fixtures, test data requirements\n", + "\n", + "πŸ€” AMBIGUITIES TO IDENTIFY IN THIS DISCOUNT SYSTEM:\n", + " β€’ What happens if discount code is expired? (error message? silent fail?)\n", + " β€’ Are discount codes case-sensitive? (SAVE10 vs save10)\n", + " β€’ What if fixed discount > cart total? (set to $0? reject?)\n", + " β€’ Can percentage be 0%? 100%? Over 100%?\n", + " β€’ How are percentages rounded? (0.5 rounds up or down?)\n", + " β€’ Race condition: Multiple users applying one-time-use code simultaneously?\n", + " β€’ Minimum cart value requirement before discount?\n", + " β€’ What if cart is empty when discount is applied?\n", + "\n", + "πŸ§ͺ EDGE CASES TO COVER:\n", + " β€’ Boundary values: 0%, 1%, 99%, 100% discounts\n", + " β€’ Minimum amounts: $0.01 cart, $0.01 discount\n", + " β€’ Maximum amounts: Very large cart values, very large discounts\n", + " β€’ Expiration: Code expires today (time zone handling?)\n", + " β€’ Usage limits: Exactly at usage limit vs over limit\n", + " β€’ Empty/null/invalid inputs: Missing codes, special characters\n", + "\n", + "❓ SELF-CHECK QUESTIONS:\n", + " β†’ Does my template request ambiguity identification?\n", + " β†’ Does it cover all test dimensions (happy/edge/error/business)?\n", + " β†’ Are test specifications detailed enough to implement?\n", + " β†’ Is the output format clear and actionable for developers?\n", + " β†’ Does it flag infrastructure needs (mock time for expiration tests)?\n", + "\n", + "╔══════════════════════════════════════════════════════════════════════════════╗\n", + "β•‘ 🎯 NEXT CHALLENGES β•‘\n", + "β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•\n", + "\n", + "After creating your template, extend your learning:\n", + " \n", + " 1️⃣ Test it on different features (user authentication, payment processing)\n", + " 2️⃣ Create specialized variants:\n", + " β€’ API testing template (focus on contracts, versioning)\n", + " β€’ Security testing template (focus on auth, input validation)\n", + " 3️⃣ Compare with the complete example in Cell 20\n", + " β€’ What did you do similarly? Differently?\n", + " β€’ Which approach generates more comprehensive tests?\n", + "\n", + "πŸ“š REFERENCE CELLS:\n", + " β€’ Cell 18: General test generation template with markdown output\n", + " β€’ Cell 20: Complete working example with payment service\n", + " β€’ Cell 33: The 3-step process for this practice activity\n", + "\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### πŸ“š Learn More: Advanced Test Generation Patterns\n", + "\n", + "Want to dive deeper into AI-powered test automation? Explore these resources:\n", + "\n", + "**πŸ“– AWS Anthropic Advanced Patterns**\n", + "- [Test Generation Command Pattern](https://github.com/aws-samples/anthropic-on-aws/blob/main/advanced-claude-code-patterns/commands/generate-tests.md) - Production-ready patterns for automated test generation\n", + "- Covers advanced topics like test data generation, coverage analysis, and CI/CD integration\n", + "\n", + "**πŸ”— Related Best Practices**\n", + "- [Claude 4 Prompt Engineering Best Practices](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/claude-4-best-practices) - Core prompting techniques\n", + "- [Prompt Templates and Variables](https://docs.claude.com/en/docs/build-with-claude/prompt-engineering/prompt-templates-and-variables) - Parameterization strategies\n", + "\n", + "**πŸ’‘ What You Can Build Next:**\n", + "- Integrate this template into your CI/CD pipeline for automatic test generation\n", + "- Create specialized variants (API testing, UI testing, security testing)\n", + "- Build a test coverage analyzer that suggests missing test scenarios\n", + "- Develop test data generators for edge case validation\n", + "\n", + "
    \n", + "🎯 Pro Tip: The AWS patterns repository includes examples of integrating test generation with AWS Lambda and CodeBuild. Perfect for automating test creation in your deployment pipeline!\n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "
    \n", + "

    🍡 Suggested Break Point #2

    \n", + "

    ~75 minutes elapsed β€’ Halfway through!

    \n", + "
    \n", + " \n", + "
    \n", + "

    βœ… Completed (Sections 1-2):

    \n", + "
      \n", + "
    • Code Review Automation Template
    • \n", + "
    • Test Generation Automation Template
    • \n", + "
    • Production-ready template structures with markdown output
    • \n", + "
    • Hands-on practice building code review and test generation templates
    • \n", + "
    \n", + "

    🎯 You've completed 2 out of 4 sections!

    \n", + "
    \n", + " \n", + "
    \n", + "

    ⏭️ Coming Next:

    \n", + "
      \n", + "
    • Section 3: LLM-as-Judge Evaluation Rubric
    • \n", + "
    • Validating AI-generated outputs
    • \n", + "
    • Quality gates and automated QA
    • \n", + "
    \n", + "

    ⏱️ Next section: ~25-30 minutes

    \n", + "
    \n", + " \n", + "
    \n", + "

    πŸ“Œ BOOKMARK TO RESUME:

    \n", + "

    \"Section 3: LLM-as-Judge Evaluation Rubric\"

    \n", + "
    \n", + " \n", + "

    \n", + " πŸ’‘ Great progress! Consider taking a break before continuing with quality assurance.\n", + "

    \n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## βš–οΈ Section 3: LLM-as-Judge Evaluation Rubric\n", + "\n", + "### Validating AI-Generated Outputs\n", + "\n", + "**The Quality Challenge:**\n", + "\n", + "When AI generates code reviews or test plans, how do you know if they're good?\n", + "\n", + "- ❓ **Trust issue** - Can we rely on AI feedback?\n", + "- πŸ“Š **Consistency** - Does quality vary between runs?\n", + "- 🎯 **Standards** - Does output meet team expectations?\n", + "\n", + "**Solution: LLM-as-Judge Pattern**\n", + "\n", + "Use a second AI call with a structured rubric to evaluate the first AI's output. Think of it as automated peer review!\n", + "\n", + "#### πŸ”„ The Workflow\n", + "\n", + "```\n", + "1. AI Generator β†’ Produces code review / test plan\n", + "2. LLM-as-Judge β†’ Evaluates quality against rubric \n", + "3. Decision β†’ Accept / Request revision / Reject\n", + "```\n", + "\n", + "**Benefits:**\n", + "- βœ… **Automated QA** - No human review needed for every AI output\n", + "- πŸ“Š **Objective scoring** - Rubric provides consistent evaluation\n", + "- πŸ” **Transparency** - Shows why output passed or failed\n", + "- πŸ”„ **Feedback loop** - Low scores trigger regeneration with improvements\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“‹ LLM-as-Judge Rubric Template\n", + "\n", + "```xml\n", + "\n", + "You are a Principal Engineer reviewing AI-generated code feedback.\n", + "\n", + "\n", + "\n", + "1. Accuracy (40%): Do the identified issues/tests align with the actual code/requirements?\n", + "2. Completeness (30%): Are major concerns covered? Are tests covering edge cases?\n", + "3. Actionability (20%): Are remediation steps clear and feasible?\n", + "4. Communication (10%): Is tone professional and structure clear?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with detailed rationale:\n", + "- 5: Excellent - Exceeds expectations\n", + "- 4: Good - Meets expectations with minor gaps\n", + "- 3: Acceptable - Meets minimum bar\n", + "- 2: Needs work - Significant gaps\n", + "- 1: Unacceptable - Fails to meet standards\n", + "\n", + "Calculate weighted total score.\n", + "Recommend:\n", + "- ACCEPT (β‰₯3.5): Production-ready\n", + "- REVISE (2.5-3.4): Needs improvements, provide specific guidance\n", + "- REJECT (<2.5): Start over with different approach\n", + "\n", + "\n", + "\n", + "{{llm_output_under_review}}\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " ACCEPT/REVISE/REJECT\n", + " \n", + "\n", + "\n", + "```\n", + "\n", + "#### πŸ”‘ Rubric Design Principles\n", + "\n", + "1. **Weighted Criteria** - Most important aspects (accuracy) weighted highest\n", + "2. **Explicit Scale** - 1-5 with clear definitions prevents ambiguity\n", + "3. **Evidence Required** - Rationale forces specific justification\n", + "4. **Actionable Thresholds** - Clear cut-offs (3.5, 2.5) for decisions\n", + "5. **Improvement Guidance** - \"REVISE\" verdict includes specific feedback\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ’» Working Example: Evaluating an AI Code Review\n", + "\n", + "Let's evaluate the quality of an AI-generated code review using our judge rubric.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Example: Evaluating an AI-Generated Code Review\n", + "\n", + "# Step 1: Generate a code review (simulated - you could use the earlier example)\n", + "sample_code = \"\"\"\n", + "def calculate_discount(price, discount_percent):\n", + " return price - (price * discount_percent / 100)\n", + "\"\"\"\n", + "\n", + "# Simulated AI review (normally this would come from get_chat_completion)\n", + "ai_generated_review = \"\"\"\n", + "\n", + " \n", + " Analyzing the discount calculation function. The logic appears straightforward but \n", + " I should check for edge cases: negative values, values > 100, type handling, \n", + " and potential precision issues with floating point arithmetic.\n", + " \n", + " \n", + " \n", + " \n", + " major\n", + " No input validation for discount_percent\n", + " Function accepts any numeric value. discount_percent > 100 would result in negative price.\n", + " Add validation: if not 0 <= discount_percent <= 100: raise ValueError(\"Discount must be between 0 and 100\")\n", + " \n", + " \n", + " \n", + " minor\n", + " No type hints\n", + " Parameters lack type annotations, making the expected types unclear.\n", + " Add type hints: def calculate_discount(price: float, discount_percent: float) -> float:\n", + " \n", + " \n", + " \n", + " nit\n", + " Missing docstring\n", + " Function lacks documentation explaining parameters and return value.\n", + " Add docstring with parameter descriptions and example usage.\n", + " \n", + " \n", + " \n", + " NEEDS REVISION\n", + " Function has correct core logic but lacks input validation which could lead to runtime bugs. Adding validation and type hints would make it production-ready.\n", + "\n", + "\"\"\"\n", + "\n", + "# Step 2: Evaluate with LLM-as-Judge\n", + "judge_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Principal Engineer reviewing AI-generated code feedback.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Original code under review:\n", + "{sample_code}\n", + "\n", + "AI-generated review to evaluate:\n", + "\n", + "\n", + "\n", + "1. Accuracy (40%): Do identified issues actually exist and are correctly described?\n", + "2. Completeness (30%): Are major concerns covered? Any critical issues missed?\n", + "3. Actionability (20%): Are recommendations specific and implementable?\n", + "4. Communication (10%): Is the review professional, clear, and well-structured?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with detailed rationale.\n", + "Calculate weighted total: (AccuracyΓ—0.4) + (CompletenessΓ—0.3) + (ActionabilityΓ—0.2) + (CommunicationΓ—0.1)\n", + "Recommend:\n", + "- ACCEPT (β‰₯3.5): Production-ready\n", + "- REVISE (2.5-3.4): Needs improvements \n", + "- REJECT (<2.5): Unacceptable quality\n", + "\n", + "\n", + "\n", + "{ai_generated_review}\n", + "\n", + "\n", + "\n", + "Provide structured evaluation with scores, weighted total, recommendation, and specific feedback.\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"βš–οΈ JUDGE EVALUATION IN PROGRESS...\")\n", + "print(\"=\"*70)\n", + "judge_result = get_chat_completion(judge_messages, temperature=0.0)\n", + "print(judge_result)\n", + "print(\"=\"*70)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“Š Why LLM-as-Judge Is Powerful\n", + "\n", + "**1. Automated Quality Gate**\n", + "```python\n", + "if weighted_score >= 3.5:\n", + " # Auto-approve and use the AI review\n", + " post_review_comment(ai_generated_review)\n", + "elif weighted_score >= 2.5:\n", + " # Trigger regeneration with feedback\n", + " regenerate_review(with_guidance=judge_feedback)\n", + "else:\n", + " # Fallback to human review\n", + " notify_human_reviewer()\n", + "```\n", + "\n", + "**2. Consistent Standards**\n", + "- Rubric encodes team expectations\n", + "- Same criteria applied every time\n", + "- Reduces reviewer bias\n", + "\n", + "**3. Continuous Improvement**\n", + "- Low scores β†’ Prompt refinement\n", + "- Track score trends over time\n", + "- A/B test different prompt versions\n", + "\n", + "**4. Transparency & Trust**\n", + "- Shows reasoning for accept/reject\n", + "- Teams can audit decisions\n", + "- Builds confidence in AI-assisted workflows\n", + "\n", + "#### 🎯 Real-World Use Cases\n", + "\n", + "| Application | Implementation |\n", + "|-------------|----------------|\n", + "| **CI/CD Pipeline** | Code review β†’ Judge eval β†’ Auto-comment if score > 3.5 |\n", + "| **Test Plan Validation** | Generate tests β†’ Judge completeness β†’ Flag gaps |\n", + "| **Documentation Review** | AI writes docs β†’ Judge clarity β†’ Request revisions |\n", + "| **Prompt Engineering** | Compare prompts β†’ Judge outputs β†’ Pick best version |\n", + "\n", + "#### πŸ”§ Customization Tips\n", + "\n", + "**Adjust weights for your context:**\n", + "```python\n", + "# Security-focused team\n", + "Accuracy: 50%, Completeness: 30%, Actionability: 15%, Communication: 5%\n", + "\n", + "# DevRel/Documentation team \n", + "Communication: 40%, Actionability: 30%, Accuracy: 20%, Completeness: 10%\n", + "\n", + "# Fast-moving startup\n", + "Actionability: 50%, Accuracy: 30%, Completeness: 15%, Communication: 5%\n", + "```\n", + "\n", + "**Add domain-specific criteria:**\n", + "- **Performance Review**: \"Does review mention Big-O complexity?\"\n", + "- **Security Review**: \"Are OWASP Top 10 risks addressed?\"\n", + "- **API Review**: \"Are breaking changes clearly flagged?\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "
    \n", + "

    πŸ§ƒ Suggested Break Point #3

    \n", + "

    ~105 minutes elapsed β€’ Almost there!

    \n", + "
    \n", + " \n", + "
    \n", + "

    βœ… Completed (Sections 1-3):

    \n", + "
      \n", + "
    • Code Review Template with Decomposition + CoT
    • \n", + "
    • Test Case Generation with Coverage Analysis
    • \n", + "
    • LLM-as-Judge Evaluation Rubric
    • \n", + "
    • Quality gates and automated validation
    • \n", + "
    \n", + "

    🎯 You've completed 3 out of 4 sections!

    \n", + "
    \n", + " \n", + "
    \n", + "

    ⏭️ Final Sprint:

    \n", + "
      \n", + "
    • Hands-On Practice Activities (4 exercises)
    • \n", + "
    • Comprehensive code review across multiple dimensions
    • \n", + "
    • Test generation for ambiguous requirements
    • \n", + "
    • Template customization and quality evaluation
    • \n", + "
    \n", + "

    ⏱️ Remaining time: ~40-50 minutes

    \n", + "
    \n", + " \n", + "
    \n", + "

    πŸ“Œ BOOKMARK TO RESUME:

    \n", + "

    \"Hands-On Practice Activities\"

    \n", + "
    \n", + " \n", + "

    \n", + " πŸ’‘ You're in the home stretch! Take a quick break before the practice exercises.\n", + "

    \n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ‹οΈ Hands-On Practice Activities\n", + "\n", + "### Activity 3.2: Comprehensive Code Review Template\n", + "\n", + "**Goal:** Create a template for comprehensive code review across multiple dimensions.\n", + "\n", + "**Scenario:** Your team needs automated code reviews for all API changes. Build a prompt template that evaluates:\n", + "- Security (authentication, input validation, common vulnerabilities)\n", + "- Performance (query optimization, algorithm efficiency)\n", + "- Code Quality (readability, maintainability, error handling)\n", + "- Best Practices (language idioms, design patterns)\n", + "\n", + "**Your Task:**\n", + "1. Adapt the code review template with comprehensive review guidelines\n", + "2. Test it on the API endpoint code below\n", + "3. Evaluate: Did it catch issues across multiple dimensions?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Activity 3.2: Security Code Review\n", + "\n", + "security_code = \"\"\"\n", + "+ @app.route('/api/user//profile', methods=['GET', 'POST'])\n", + "+ def user_profile(user_id):\n", + "+ if request.method == 'POST':\n", + "+ # Update user profile\n", + "+ data = request.get_json()\n", + "+ query = f\"UPDATE users SET bio='{data['bio']}', website='{data['website']}' WHERE id={user_id}\"\n", + "+ db.execute(query)\n", + "+ \n", + "+ # Store uploaded avatar\n", + "+ if 'avatar' in request.files:\n", + "+ file = request.files['avatar']\n", + "+ file.save(f'/uploads/{file.filename}')\n", + "+ \n", + "+ return jsonify({\"message\": \"Profile updated\"})\n", + "+ \n", + "+ # Get user profile\n", + "+ user = db.query(f\"SELECT * FROM users WHERE id={user_id}\").fetchone()\n", + "+ return jsonify(user)\n", + "\"\"\"\n", + "\n", + "# TODO: Build your security review template\n", + "# Hints:\n", + "# - Role: \"Senior Security Engineer\" or \"Application Security Specialist\"\n", + "# - Guidelines: Check for SQL injection, path traversal, missing auth, XSS\n", + "# - Focus areas: Input validation, authentication, file upload security\n", + "# - Severity: Use security-specific levels (Critical/High/Medium/Low)\n", + "\n", + "security_review_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Senior Application Security Engineer specializing in web API security.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: user-api-service \n", + "Endpoint: User Profile Management (new endpoint)\n", + "Security Focus: OWASP Top 10, authentication, input validation\n", + "\n", + "\n", + "\n", + "{security_code}\n", + "\n", + "\n", + "\n", + "1. Check for OWASP Top 10 vulnerabilities (SQL injection, XSS, broken auth, etc.)\n", + "2. Verify authentication and authorization mechanisms\n", + "3. Assess input validation and sanitization\n", + "4. Review file upload handling for path traversal\n", + "5. Check for sensitive data exposure\n", + "6. Cite exact lines with CVE/CWE references where applicable\n", + "\n", + "\n", + "\n", + "Step 1 - Think: In tags, identify security vulnerabilities.\n", + "Step 2 - Assess: For each issue, provide:\n", + " β€’ Severity (critical/high/medium/low)\n", + " β€’ Vulnerability type (SQL injection, etc.)\n", + " β€’ Evidence (line numbers, attack vector)\n", + " β€’ CVE/CWE reference if applicable\n", + "Step 3 - Suggest: Provide secure code alternatives.\n", + "Step 4 - Verdict: Security assessment (block/requires-fixes/approve-with-notes).\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"πŸ”’ SECURITY REVIEW - Activity 3.2\")\n", + "print(\"=\"*70)\n", + "security_result = get_chat_completion(security_review_messages, temperature=0.0)\n", + "print(security_result)\n", + "print(\"=\"*70)\n", + "print(\"\\nπŸ’‘ Expected findings:\")\n", + "print(\" - SQL Injection (Critical) - f-string query construction\")\n", + "print(\" - Path Traversal (High) - Unsafe file.filename usage\")\n", + "print(\" - Missing Authentication (Critical) - No auth check on endpoint\")\n", + "print(\" - Potential XSS (Medium) - Unvalidated user data returned\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### βœ… Solution Analysis\n", + "\n", + "**Key Best Practices Demonstrated:**\n", + "\n", + "1. **Multi-Dimensional Role** - Uses \"Senior Software Engineer\" with broad expertise (security, performance, quality)\n", + "2. **Balanced Review Guidelines** - Covers security, performance, maintainability, and best practices\n", + "3. **Clear Categories** - Categorizes findings (Security / Performance / Quality / Correctness)\n", + "4. **Practical Severity** - Uses CRITICAL/MAJOR/MINOR based on impact across all dimensions\n", + "5. **Actionable Feedback** - Provides concrete fixes and recommendations\n", + "\n", + "**Expected Findings:**\n", + "- βœ… SQL Injection - f-string query construction (Security)\n", + "- βœ… Path Traversal - Unsafe file.filename usage (Security)\n", + "- βœ… Missing Authentication - No auth decorator (Security)\n", + "- βœ… Poor Error Handling - Potential XSS in responses (Quality/Security)\n", + "\n", + "**πŸ“– Full Solution:** See [solutions/activity-3.2-code-review-solution.md](solutions/activity-3.2-code-review-solution.md) for:\n", + "- Detailed analysis of each best practice\n", + "- Production CI/CD integration examples\n", + "- Customization patterns for different tech stacks and contexts\n", + "- Metrics for tracking template effectiveness\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Activity 3.3: Template Customization Challenge\n", + "\n", + "**Goal:** Customize prompt templates for your team's specific needs.\n", + "\n", + "**Scenario:** Different teams have different review standards. Adapt the base template for various contexts.\n", + "\n", + "**Your Task:** Choose one and implement it:\n", + "\n", + "**Option A: Performance-Focused Review**\n", + "- Role: \"Senior Performance Engineer\"\n", + "- Focus: Big-O complexity, caching, database query optimization, memory usage\n", + "- Test on: A function with nested loops or N+1 query problem\n", + "\n", + "**Option B: DevOps/SRE Review** \n", + "- Role: \"Site Reliability Engineer\"\n", + "- Focus: Observability (logging, metrics, tracing), error handling, graceful degradation\n", + "- Test on: A service initialization function\n", + "\n", + "**Option C: API Design Review**\n", + "- Role: \"API Architect\" \n", + "- Focus: RESTful conventions, versioning, backward compatibility, error responses\n", + "- Test on: A new API endpoint design\n", + "\n", + "Pick one and build it below!\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### βœ… Solution Analysis\n", + "\n", + "**Key Best Practices Demonstrated:**\n", + "\n", + "1. **Domain-Specific Role** - \"Performance Engineer\" not generic \"Engineer\"\n", + "2. **Scale Context** - \"Must handle 1000+ posts\" sets clear performance bar\n", + "3. **Quantified Analysis** - Big-O notation, query counts, latency estimates (not vague \"slow\")\n", + "4. **Before/After Metrics** - Shows improvement: 100s β†’ 0.05s (2000x faster!)\n", + "5. **Actionable Optimizations** - Provides exact code for the fix\n", + "\n", + "**Expected Findings:**\n", + "- βœ… N+1 Query Problem - 2001 database queries (1 user + 1000 posts + 1000 likes)\n", + "- βœ… Complexity: O(n) queries with network latency = 100 seconds for 1000 posts\n", + "- βœ… Solution: Single join query reduces to O(1) = 0.05 seconds\n", + "- βœ… Additional opportunities: Caching, pagination, indexing\n", + "\n", + "**Adaptation Pattern:**\n", + "\n", + "| Domain | Role | Focus | Output Metrics |\n", + "|--------|------|-------|----------------|\n", + "| **Performance** | Performance Engineer | Big-O, N+1, caching | Latency, query counts |\n", + "| **SRE** | Site Reliability Engineer | Logging, metrics, resilience | Observability gaps |\n", + "| **API Design** | API Architect | REST, versioning | Breaking changes |\n", + "\n", + "**Key Takeaway:** Same template structure, different expertise area!\n", + "\n", + "**πŸ“– Full Solution:** See [solutions/activity-3.3-customization-solution.md](solutions/activity-3.3-customization-solution.md) for:\n", + "- Complete N+1 query analysis and optimized code\n", + "- Full adaptation patterns for SRE, API design, React\n", + "- When to create domain-specific templates\n", + "- Step-by-step customization strategy\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Activity 3.3: Template Customization\n", + "\n", + "# Example: Performance-Focused Review\n", + "perf_code = \"\"\"\n", + "+ def get_user_posts_with_likes(user_id):\n", + "+ user = User.query.get(user_id)\n", + "+ posts = []\n", + "+ for post_id in user.post_ids:\n", + "+ post = Post.query.get(post_id)\n", + "+ like_count = Like.query.filter_by(post_id=post.id).count()\n", + "+ post.likes = like_count\n", + "+ posts.append(post)\n", + "+ return posts\n", + "\"\"\"\n", + "\n", + "# TODO: Customize for YOUR chosen focus area\n", + "# This example shows performance review - adapt for your choice!\n", + "\n", + "custom_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Senior Performance Engineer specializing in database optimization.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: social-media-api\n", + "Function: get_user_posts_with_likes\n", + "Performance Requirements: Must handle users with 1000+ posts efficiently\n", + "\n", + "\n", + "\n", + "{perf_code}\n", + "\n", + "\n", + "\n", + "1. Analyze algorithmic complexity (Big-O notation)\n", + "2. Identify N+1 query problems\n", + "3. Check for caching opportunities\n", + "4. Assess memory usage patterns\n", + "5. Recommend performance optimizations\n", + "6. Estimate performance impact with data size\n", + "\n", + "\n", + "\n", + "Step 1 - Think: In , analyze time/space complexity and identify bottlenecks.\n", + "Step 2 - Assess: For each issue:\n", + " β€’ Severity (critical/high/medium/low based on performance impact)\n", + " β€’ Complexity analysis (O(n), O(nΒ²), etc.)\n", + " β€’ Evidence (specific operations causing slowdown)\n", + " β€’ Performance impact estimate\n", + "Step 3 - Suggest: Provide optimized code with complexity improvement.\n", + "Step 4 - Verdict: Performance rating and estimated improvement.\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"⚑ CUSTOM TEMPLATE - Activity 3.3 (Performance Review)\")\n", + "print(\"=\"*70)\n", + "custom_result = get_chat_completion(custom_messages, temperature=0.0)\n", + "print(custom_result)\n", + "print(\"=\"*70)\n", + "print(\"\\nπŸ’‘ Adaptation tips:\")\n", + "print(\" - Changed role to match domain\")\n", + "print(\" - Added domain-specific guidelines (Big-O, N+1)\")\n", + "print(\" - Modified severity to reflect performance impact\")\n", + "print(\" - Customized output format for complexity analysis\")\n", + "print(\"\\n Try adapting this for SRE or API Design review!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Activity 3.4: Quality Evaluation with LLM-as-Judge\n", + "\n", + "**Goal:** Build an automated quality gate for AI-generated outputs.\n", + "\n", + "**Scenario:** You're implementing automated code reviews in your CI/CD pipeline. Before posting AI reviews to PRs, you want quality assurance.\n", + "\n", + "**Your Task:**\n", + "1. Generate a code review (use any previous example or create new one)\n", + "2. Create an LLM-as-Judge rubric for your team's standards\n", + "3. Evaluate the review and decide: Accept / Revise / Reject\n", + "4. Reflection: Would this catch low-quality AI outputs?\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### βœ… Solution Analysis\n", + "\n", + "**Key Best Practices Demonstrated:**\n", + "\n", + "1. **Intentionally Low-Quality Input** - Tests judge with vague review (\"Make it better\")\n", + "2. **Context-Specific Rubric** - Weights match CI/CD needs (Specificity 40%, Actionability 30%)\n", + "3. **Clear Scoring Scale** - 1-5 with explicit definitions (not subjective \"good/bad\")\n", + "4. **Actionable Thresholds** - β‰₯4.0 = accept, 2.5-3.9 = revise, <2.5 = reject\n", + "5. **Improvement Loop** - Provides specific feedback for regeneration\n", + "\n", + "**Expected Judge Scores:**\n", + "\n", + "```\n", + "Specificity: 1/5 ❌ - \"Function could be improved\" is vague\n", + "Actionability: 1/5 ❌ - \"Make it better\" not actionable\n", + "Technical Accuracy: 2/5 ⚠️ - Can't verify without specifics\n", + "Completeness: 2/5 ⚠️ - Only 1 issue found, likely incomplete\n", + "\n", + "Weighted Total: (1Γ—0.4) + (1Γ—0.3) + (2Γ—0.2) + (2Γ—0.1) = 1.3\n", + "Decision: REJECT (<2.5) 🚫\n", + "```\n", + "\n", + "**Production Workflow:**\n", + "\n", + "```python\n", + "if score >= 4.0:\n", + " post_to_pr(review) # Auto-approve\n", + "elif score >= 2.5:\n", + " regenerate_with_feedback() # Retry with guidance\n", + "else:\n", + " flag_for_human_review() # Fallback\n", + "```\n", + "\n", + "**Why This Matters:**\n", + "- Prevents vague AI outputs from reaching users\n", + "- Builds trust through consistent quality\n", + "- Enables true automation (not just \"AI suggestion\")\n", + "\n", + "**πŸ“– Full Solution:** See [solutions/activity-3.4-judge-solution.md](solutions/activity-3.4-judge-solution.md) for:\n", + "- Complete judge evaluation breakdown\n", + "- Production quality gate implementation with retry logic\n", + "- Monitoring dashboard examples\n", + "- Success metrics to track (acceptance rate, cost per review)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Activity 3.4: Build Your Own Judge\n", + "\n", + "# Sample AI-generated output to evaluate (simulated)\n", + "ai_output_to_judge = \"\"\"\n", + "\n", + " \n", + " \n", + " medium\n", + " Function could be improved\n", + " The code is not optimal\n", + " Make it better\n", + " \n", + " \n", + " NEEDS WORK\n", + " Some issues found\n", + "\n", + "\"\"\"\n", + "\n", + "# TODO: This is a LOW QUALITY review (vague, no specifics)\n", + "# Build a judge that catches this!\n", + "\n", + "judge_eval_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Principal Engineer evaluating AI-generated code reviews for your team's CI/CD pipeline.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Your team is implementing automated code reviews. Reviews must meet high standards before being posted to PRs.\n", + "\n", + "\n", + "\n", + "1. Specificity (40%): Are issues concrete with exact evidence (line numbers, code snippets)?\n", + "2. Actionability (30%): Can developer immediately act on recommendations?\n", + "3. Technical Accuracy (20%): Are the issues technically sound?\n", + "4. Completeness (10%): Are major categories covered (security, performance, correctness)?\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5:\n", + "- 5: Excellent - Ready for production\n", + "- 4: Good - Minor improvements needed\n", + "- 3: Acceptable - Meets minimum bar\n", + "- 2: Poor - Significant issues, needs revision\n", + "- 1: Unacceptable - Reject and regenerate\n", + "\n", + "Calculate weighted score.\n", + "Provide specific feedback for scores < 4.\n", + "\n", + "Decision thresholds:\n", + "- ACCEPT (β‰₯4.0): Post to PR\n", + "- REVISE (2.5-3.9): Regenerate with specific guidance\n", + "- REJECT (<2.5): Discard, use different approach\n", + "\n", + "\n", + "\n", + "{ai_output_to_judge}\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " ACCEPT/REVISE/REJECT\n", + " Specific guidance for improvement\n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + "]\n", + "\n", + "print(\"βš–οΈ QUALITY EVALUATION - Activity 3.4\")\n", + "print(\"=\"*70)\n", + "print(\"Evaluating this AI-generated review:\")\n", + "print(ai_output_to_judge)\n", + "print(\"\\n\" + \"=\"*70)\n", + "judge_eval_result = get_chat_completion(judge_eval_messages, temperature=0.0)\n", + "print(judge_eval_result)\n", + "print(\"=\"*70)\n", + "print(\"\\nπŸ’‘ This review is intentionally vague. Your judge should:\")\n", + "print(\" - Give low scores for Specificity (no line numbers)\")\n", + "print(\" - Give low scores for Actionability ('make it better' is useless)\")\n", + "print(\" - Recommend REVISE or REJECT\")\n", + "print(\"\\n If your judge caught these issues, it's working! βœ…\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“¦ Reference Implementation: Production-Ready Template Library\n", + "\n", + "Below is a complete, copy-paste ready implementation that demonstrates all best practices from this module.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Production-Ready Template Library\n", + "# Copy this to your project and customize for your team\n", + "\n", + "from typing import Dict, List, Any, Optional\n", + "from dataclasses import dataclass\n", + "from enum import Enum\n", + "\n", + "# ============================================\n", + "# 1. TEMPLATE DEFINITIONS\n", + "# ============================================\n", + "\n", + "class ReviewTemplate:\n", + " \"\"\"Base template for code reviews with parameterization\"\"\"\n", + " \n", + " @staticmethod\n", + " def code_review_template(\n", + " tech_stack: str = \"Python microservices\",\n", + " repo_name: str = \"{{repo_name}}\",\n", + " service_name: str = \"{{service_name}}\",\n", + " code_diff: str = \"{{code_diff}}\"\n", + " ) -> List[Dict[str, str]]:\n", + " \"\"\"\n", + " Production-ready code review template.\n", + " \n", + " Args:\n", + " tech_stack: Technology focus (e.g., \"Python microservices\", \"React frontend\")\n", + " repo_name: Repository name for context\n", + " service_name: Service/component name\n", + " code_diff: Git diff to review\n", + " \n", + " Returns:\n", + " Messages array ready for AI completion\n", + " \"\"\"\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": f\"You are a Senior Backend Engineer specializing in {tech_stack}.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: {repo_name}\n", + "Service: {service_name}\n", + "\n", + "\n", + "\n", + "{code_diff}\n", + "\n", + "\n", + "\n", + "1. Highlight issues affecting correctness, security, performance, and maintainability.\n", + "2. Cite exact lines or blocks.\n", + "3. If code is acceptable, confirm with justification.\n", + "\n", + "\n", + "\n", + "Step 1 - Think: In tags, outline potential issues.\n", + "Step 2 - Assess: For each issue, provide severity, description, evidence.\n", + "Step 3 - Suggest: Offer actionable remediation tips.\n", + "Step 4 - Verdict: Conclude with pass/fail and summary.\n", + "\n", + "\n", + "\n", + "\n", + " ...\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + " ]\n", + "\n", + " @staticmethod\n", + " def security_review_template(\n", + " repo_name: str = \"{{repo_name}}\",\n", + " code_diff: str = \"{{code_diff}}\"\n", + " ) -> List[Dict[str, str]]:\n", + " \"\"\"Security-focused review template\"\"\"\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Senior Application Security Engineer specializing in OWASP Top 10 vulnerabilities.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "Repository: {repo_name}\n", + "Security Focus: OWASP Top 10, authentication, input validation\n", + "\n", + "\n", + "\n", + "{code_diff}\n", + "\n", + "\n", + "\n", + "1. Check for OWASP Top 10 vulnerabilities\n", + "2. Verify authentication and authorization\n", + "3. Assess input validation and sanitization\n", + "4. Check for sensitive data exposure\n", + "5. Cite CVE/CWE references where applicable\n", + "\n", + "\n", + "\n", + "Step 1 - Think: In , identify security vulnerabilities.\n", + "Step 2 - Assess: For each issue, provide severity, type, evidence, CWE reference.\n", + "Step 3 - Suggest: Provide secure code alternatives.\n", + "Step 4 - Verdict: Security assessment (block/requires-fixes/approve-with-notes).\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " critical/high/medium/low\n", + " vulnerability type\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + " ]\n", + "\n", + " @staticmethod\n", + " def test_generation_template(\n", + " tech_stack: str = \"Python/pytest\",\n", + " requirements: str = \"{{requirements}}\",\n", + " existing_tests: str = \"{{existing_tests}}\"\n", + " ) -> List[Dict[str, str]]:\n", + " \"\"\"Test case generation template\"\"\"\n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": f\"You are a QA Automation Lead with expertise in {tech_stack}.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "{requirements}\n", + "\n", + "\n", + "\n", + "{existing_tests}\n", + "\n", + "\n", + "\n", + "1. Analyze requirements and identify ambiguities.\n", + "2. List coverage gaps in existing tests.\n", + "3. Generate test cases: happy paths, edge cases, error paths, business rules.\n", + "4. Separate unit tests from integration tests.\n", + "5. Flag missing test data or dependencies.\n", + "\n", + "\n", + "\n", + "Provide analysis in tags.\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " ...\n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + " ]\n", + "\n", + " @staticmethod\n", + " def judge_template(\n", + " submission: str,\n", + " criteria_weights: Optional[Dict[str, float]] = None\n", + " ) -> List[Dict[str, str]]:\n", + " \"\"\"LLM-as-Judge evaluation template\"\"\"\n", + " \n", + " if criteria_weights is None:\n", + " criteria_weights = {\n", + " \"Accuracy\": 0.40,\n", + " \"Completeness\": 0.30,\n", + " \"Actionability\": 0.20,\n", + " \"Communication\": 0.10\n", + " }\n", + " \n", + " weights_str = \"\\n\".join([\n", + " f\"{i+1}. {name} ({int(weight*100)}%)\" \n", + " for i, (name, weight) in enumerate(criteria_weights.items())\n", + " ])\n", + " \n", + " return [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a Principal Engineer evaluating AI-generated outputs for quality.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": f\"\"\"\n", + "\n", + "{weights_str}\n", + "\n", + "\n", + "\n", + "Score each criterion 1-5 with rationale.\n", + "Calculate weighted total.\n", + "Recommend: ACCEPT (β‰₯3.5), REVISE (2.5-3.4), REJECT (<2.5)\n", + "\n", + "\n", + "\n", + "{submission}\n", + "\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " ACCEPT/REVISE/REJECT\n", + " \n", + "\n", + "\n", + "\"\"\"\n", + " }\n", + " ]\n", + "\n", + "\n", + "# ============================================\n", + "# 2. WORKFLOW AUTOMATION\n", + "# ============================================\n", + "\n", + "@dataclass\n", + "class ReviewResult:\n", + " \"\"\"Structured review result\"\"\"\n", + " content: str\n", + " score: Optional[float] = None\n", + " verdict: Optional[str] = None\n", + " passed_quality_gate: bool = False\n", + "\n", + "\n", + "def automated_review_workflow(\n", + " code_diff: str,\n", + " repo_name: str,\n", + " quality_threshold: float = 3.5,\n", + " max_retries: int = 2\n", + ") -> ReviewResult:\n", + " \"\"\"\n", + " Complete automated review workflow with quality gate.\n", + " \n", + " This demonstrates best practices:\n", + " - Template parameterization\n", + " - LLM-as-Judge validation\n", + " - Retry logic with feedback\n", + " - Structured output\n", + " \n", + " Args:\n", + " code_diff: Git diff to review\n", + " repo_name: Repository name for context\n", + " quality_threshold: Minimum score to accept (default 3.5)\n", + " max_retries: Maximum regeneration attempts\n", + " \n", + " Returns:\n", + " ReviewResult with content and quality metrics\n", + " \"\"\"\n", + " \n", + " for attempt in range(max_retries + 1):\n", + " try:\n", + " # Step 1: Generate review\n", + " review_messages = ReviewTemplate.code_review_template(\n", + " repo_name=repo_name,\n", + " code_diff=code_diff\n", + " )\n", + " \n", + " review_content = get_chat_completion(review_messages, temperature=0.0)\n", + " if not review_content:\n", + " raise ValueError(\"Review generation returned empty result\")\n", + " \n", + " # Step 2: Evaluate with judge\n", + " judge_messages = ReviewTemplate.judge_template(submission=review_content)\n", + " judge_result = get_chat_completion(judge_messages, temperature=0.0)\n", + " if not judge_result:\n", + " raise ValueError(\"Judge evaluation returned empty result\")\n", + " \n", + " # Step 3: Parse score (simplified - production would use XML parsing)\n", + " # This is a placeholder - implement proper XML parsing\n", + " score = 4.0 # Placeholder\n", + " \n", + " # Step 4: Decision\n", + " if score >= quality_threshold:\n", + " return ReviewResult(\n", + " content=review_content,\n", + " score=score,\n", + " passed_quality_gate=True\n", + " )\n", + " elif attempt < max_retries:\n", + " print(f\"⚠️ Quality score {score} below threshold. Retry {attempt+1}/{max_retries}\")\n", + " continue\n", + " else:\n", + " return ReviewResult(\n", + " content=review_content,\n", + " score=score,\n", + " passed_quality_gate=False\n", + " )\n", + " \n", + " except Exception as e:\n", + " print(f\"❌ Error on attempt {attempt+1}: {e}\")\n", + " if attempt == max_retries:\n", + " return ReviewResult(\n", + " content=f\"Error: {e}\",\n", + " passed_quality_gate=False\n", + " )\n", + " \n", + " return ReviewResult(content=\"Max retries exceeded\", passed_quality_gate=False)\n", + "\n", + "\n", + "# ============================================\n", + "# 3. EXAMPLE USAGE\n", + "# ============================================\n", + "\n", + "print(\"πŸ“¦ Production-Ready Template Library Loaded!\")\n", + "print(\"\\nβœ… Available templates:\")\n", + "print(\" - ReviewTemplate.code_review_template()\")\n", + "print(\" - ReviewTemplate.security_review_template()\")\n", + "print(\" - ReviewTemplate.test_generation_template()\")\n", + "print(\" - ReviewTemplate.judge_template()\")\n", + "print(\"\\nβœ… Workflow automation:\")\n", + "print(\" - automated_review_workflow()\")\n", + "print(\"\\nπŸ’‘ Copy this cell to your project and customize!\")\n", + "print(\"\\nπŸ“ Usage example:\")\n", + "print(\"\"\"\n", + "# Basic usage\n", + "messages = ReviewTemplate.code_review_template(\n", + " tech_stack=\"React frontend\",\n", + " repo_name=\"my-app\",\n", + " code_diff=my_diff\n", + ")\n", + "result = get_chat_completion(messages)\n", + "\n", + "# With quality gate\n", + "result = automated_review_workflow(\n", + " code_diff=my_diff,\n", + " repo_name=\"my-app\",\n", + " quality_threshold=4.0\n", + ")\n", + "if result.passed_quality_gate:\n", + " post_to_pr(result.content)\n", + "\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "
    \n", + "
    \n", + "

    🎯 Suggested Break Point #4

    \n", + "

    ~145 minutes elapsed β€’ Final section!

    \n", + "
    \n", + " \n", + "
    \n", + "

    βœ… Completed:

    \n", + "
      \n", + "
    • Section 1: Code Review Automation Template
    • \n", + "
    • Section 2: Test Case Generation
    • \n", + "
    • Section 3: LLM-as-Judge Evaluation
    • \n", + "
    • All Hands-On Practice Activities (4 exercises)
    • \n", + "
    • Production-Ready Template Library
    • \n", + "
    \n", + "

    πŸŽ‰ You've completed all core sections and exercises!

    \n", + "
    \n", + " \n", + "
    \n", + "

    ⏭️ Final Topics:

    \n", + "
      \n", + "
    • Section 4: Template Best Practices & Quality Checklist
    • \n", + "
    • Version control and maintenance strategies
    • \n", + "
    • Production deployment guidelines
    • \n", + "
    • CI/CD and automation integration patterns
    • \n", + "
    \n", + "

    ⏱️ Remaining time: ~5-10 minutes (reading)

    \n", + "
    \n", + " \n", + "
    \n", + "

    πŸ“Œ BOOKMARK TO RESUME:

    \n", + "

    \"Section 4: Template Best Practices\"

    \n", + "
    \n", + " \n", + "

    \n", + " πŸ’‘ Nearly done! The final section covers deployment best practices and is mostly reading.\n", + "

    \n", + "
    \n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ“Š Section 4: Template Best Practices & Quality Checklist\n", + "\n", + "### Quality Checklist Before Deployment\n", + "\n", + "Before using a prompt template in production, validate it meets these standards:\n", + "\n", + "#### βœ… Role & Context\n", + "- [ ] **Role description** matches task scope and domain expertise\n", + "- [ ] **Expertise level** is appropriate (Junior/Senior/Principal)\n", + "- [ ] **Domain specification** is clear (Backend/Frontend/Security/Performance)\n", + "- [ ] **Context** includes necessary background (repo, service, requirements)\n", + "\n", + "#### βœ… Instructions & Structure\n", + "- [ ] **Tasks decomposed** into explicit, numbered steps\n", + "- [ ] **Required outputs** are clearly specified\n", + "- [ ] **XML/structured tags** used for organization (``, ``, etc.)\n", + "- [ ] **Examples provided** where format is ambiguous\n", + "\n", + "#### βœ… Reasoning & Transparency\n", + "- [ ] **Chain-of-thought** requested for complex analysis\n", + "- [ ] **Inner monologue** tagged if reasoning should be separated from output\n", + "- [ ] **Evidence required** for all claims (line numbers, specific quotes)\n", + "- [ ] **Rationale requested** for subjective decisions\n", + "\n", + "#### βœ… Output Format\n", + "- [ ] **Structured format** defined (XML, JSON, or clear template)\n", + "- [ ] **Severity/priority** levels standardized across team\n", + "- [ ] **Output is parseable** by automation tools if needed\n", + "- [ ] **Format examples** provided in prompt or documentation\n", + "\n", + "#### βœ… Evaluation & Quality\n", + "- [ ] **LLM-as-Judge rubric** defined with weighted criteria\n", + "- [ ] **Acceptance thresholds** established (e.g., score β‰₯ 3.5)\n", + "- [ ] **Failure modes** identified with fallback strategies\n", + "- [ ] **Quality metrics** tracked over time\n", + "\n", + "#### βœ… Parameterization & Reuse\n", + "- [ ] **Variables identified** and marked with `{{placeholders}}`\n", + "- [ ] **Template documented** with parameter descriptions\n", + "- [ ] **Usage examples** provided for team members\n", + "- [ ] **Default values** specified where appropriate\n", + "\n", + "#### βœ… Testing & Validation\n", + "- [ ] **Tested on multiple scenarios** (happy path, edge cases, errors)\n", + "- [ ] **Peer reviewed** by subject matter experts\n", + "- [ ] **Failure cases** tested (what happens with bad input?)\n", + "- [ ] **Performance measured** (latency, token usage, cost)\n", + "\n", + "#### βœ… Team Alignment\n", + "- [ ] **Standards match** team conventions (severity labels, output format)\n", + "- [ ] **Language/tone** appropriate for team culture\n", + "- [ ] **Integration points** defined (CI/CD, IDE, chat tools)\n", + "- [ ] **Feedback mechanism** established for continuous improvement\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸ“ Template Versioning & Maintenance\n", + "\n", + "**Treat prompts like code** - version control and track changes!\n", + "\n", + "#### Version Control Structure\n", + "```\n", + "prompts/\n", + "β”œβ”€β”€ code-review/\n", + "β”‚ β”œβ”€β”€ v1.0-baseline.md\n", + "β”‚ β”œβ”€β”€ v1.1-added-security-focus.md\n", + "β”‚ β”œβ”€β”€ v2.0-restructured-output.md\n", + "β”‚ └── CHANGELOG.md\n", + "β”œβ”€β”€ test-generation/\n", + "β”‚ β”œβ”€β”€ v1.0-baseline.md\n", + "β”‚ └── CHANGELOG.md\n", + "└── llm-as-judge/\n", + " β”œβ”€β”€ code-review-judge-v1.0.md\n", + " └── CHANGELOG.md\n", + "```\n", + "\n", + "#### CHANGELOG Example\n", + "```markdown\n", + "## Code Review Template - Changelog\n", + "\n", + "### v2.0 (2024-03-15)\n", + "**Breaking Changes:**\n", + "- Changed output format from plain text to XML\n", + "- Renamed severity levels: blockerβ†’critical, nitβ†’trivial\n", + "\n", + "**Improvements:**\n", + "- Added for reasoning transparency\n", + "- Increased evidence requirement (must cite line numbers)\n", + "- Added performance impact estimation\n", + "\n", + "**Metrics:**\n", + "- LLM-as-Judge avg score: 4.2 β†’ 4.6\n", + "- False positive rate: 12% β†’ 8%\n", + "- User satisfaction: 3.8 β†’ 4.3\n", + "\n", + "### v1.1 (2024-02-20)\n", + "**Improvements:**\n", + "- Added security-specific guidelines (OWASP Top 10)\n", + "- Increased token limit to handle larger diffs\n", + "\n", + "**Metrics:**\n", + "- Caught 15% more security issues in testing\n", + "```\n", + "\n", + "#### When to Version Bump\n", + "- **Major (v1 β†’ v2)**: Breaking changes to output format, role changes\n", + "- **Minor (v1.0 β†’ v1.1)**: Added capabilities, new guidelines\n", + "- **Patch (v1.1.0 β†’ v1.1.1)**: Bug fixes, clarity improvements\n", + "\n", + "#### A/B Testing Prompts\n", + "```python\n", + "# Compare two prompt versions\n", + "results_v1 = run_reviews_with_template(\"code-review-v1.0.md\", test_prs)\n", + "results_v2 = run_reviews_with_template(\"code-review-v2.0.md\", test_prs)\n", + "\n", + "# Evaluate with LLM-as-Judge\n", + "scores_v1 = [judge(r) for r in results_v1]\n", + "scores_v2 = [judge(r) for r in results_v2]\n", + "\n", + "print(f\"v1.0 avg score: {mean(scores_v1)}\") # 3.8\n", + "print(f\"v2.0 avg score: {mean(scores_v2)}\") # 4.3\n", + "# Deploy v2.0!\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### πŸš€ Extension & Automation Ideas\n", + "\n", + "Ready to take it further? Here are real-world integration patterns:\n", + "\n", + "#### 1. CI/CD Pipeline Integration\n", + "```yaml\n", + "# .github/workflows/ai-code-review.yml\n", + "name: AI Code Review\n", + "\n", + "on: pull_request\n", + "\n", + "jobs:\n", + " ai-review:\n", + " runs-on: ubuntu-latest\n", + " steps:\n", + " - name: Get PR diff\n", + " run: gh pr diff ${{ github.event.pull_request.number }} > diff.txt\n", + " \n", + " - name: Run AI Review\n", + " run: python scripts/ai_review.py --diff diff.txt --template prompts/code-review-v2.0.md\n", + " \n", + " - name: Evaluate with Judge\n", + " run: python scripts/judge_review.py --review review.json\n", + " \n", + " - name: Post if High Quality (score β‰₯ 4.0)\n", + " if: steps.judge.outputs.score >= 4.0\n", + " run: gh pr comment ${{ github.event.pull_request.number }} --body-file review.md\n", + "```\n", + "\n", + "#### 2. IDE Integration (VS Code Extension)\n", + "```javascript\n", + "// AI Review on Save\n", + "vscode.workspace.onDidSaveTextDocument((doc) => {\n", + " const diff = getDiff(doc);\n", + " const template = loadTemplate('code-review-v2.0.md');\n", + " const review = callAI(template, diff);\n", + " const score = judgeReview(review);\n", + " \n", + " if (score >= 3.5) {\n", + " showInlineComments(review);\n", + " }\n", + "});\n", + "```\n", + "\n", + "#### 3. Slack Bot Integration\n", + "```python\n", + "@slack_app.command(\"/ai-review\")\n", + "def review_command(ack, body, say):\n", + " pr_url = body['text']\n", + " diff = github.get_pr_diff(pr_url)\n", + " \n", + " review = generate_review(diff, template='code-review-v2.0.md')\n", + " score = judge_review(review)\n", + " \n", + " if score >= 4.0:\n", + " say(f\"βœ… AI Review (score: {score}/5.0):\\n{review}\")\n", + " else:\n", + " say(f\"⚠️ Low confidence review (score: {score}/5.0). Human review recommended.\")\n", + "```\n", + "\n", + "#### 4. Pre-Commit Hook\n", + "```bash\n", + "#!/bin/bash\n", + "# .git/hooks/pre-commit\n", + "\n", + "# Get staged changes\n", + "git diff --cached > /tmp/staged.diff\n", + "\n", + "# Run AI review\n", + "python scripts/quick_review.py /tmp/staged.diff\n", + "\n", + "# Ask for confirmation if issues found\n", + "if [ $? -ne 0 ]; then\n", + " read -p \"AI found issues. Continue? (y/n) \" -n 1 -r\n", + " echo\n", + " if [[ ! $REPLY =~ ^[Yy]$ ]]; then\n", + " exit 1\n", + " fi\n", + "fi\n", + "```\n", + "\n", + "#### 5. Test Generation in Sprint Planning\n", + "```python\n", + "def generate_test_plan(feature_spec: str) -> TestPlan:\n", + " \"\"\"Generate test plan during sprint planning\"\"\"\n", + " \n", + " # Generate tests\n", + " test_plan = generate_tests(\n", + " requirements=feature_spec,\n", + " existing_tests=get_current_suite(),\n", + " template='test-generation-v1.0.md'\n", + " )\n", + " \n", + " # Validate coverage\n", + " judge_result = evaluate_test_plan(test_plan)\n", + " \n", + " if judge_result.score < 3.5:\n", + " # Regenerate with feedback\n", + " test_plan = generate_tests(\n", + " requirements=feature_spec,\n", + " existing_tests=get_current_suite(),\n", + " template='test-generation-v1.0.md',\n", + " previous_feedback=judge_result.feedback\n", + " )\n", + " \n", + " return test_plan\n", + "\n", + "# Use in planning:\n", + "story_points = estimate_from_test_count(test_plan.total_tests)\n", + "```\n", + "\n", + "#### 6. Continuous Monitoring Dashboard\n", + "```python\n", + "# Track prompt performance over time\n", + "dashboard = {\n", + " \"code_review_v2.0\": {\n", + " \"avg_judge_score\": 4.3,\n", + " \"usage_count\": 1247,\n", + " \"acceptance_rate\": 0.89,\n", + " \"avg_latency_ms\": 3200,\n", + " \"cost_per_review\": 0.04\n", + " },\n", + " \"test_generation_v1.0\": {\n", + " \"avg_judge_score\": 3.9,\n", + " \"usage_count\": 543,\n", + " \"acceptance_rate\": 0.76,\n", + " \"avg_latency_ms\": 4100,\n", + " \"cost_per_plan\": 0.08\n", + " }\n", + "}\n", + "```\n", + "\n", + "#### 🎯 Start Small, Scale Gradually\n", + "1. **Week 1**: Use templates manually in code reviews\n", + "2. **Week 2**: Add LLM-as-Judge validation\n", + "3. **Week 3**: Integrate into one repo's CI/CD\n", + "4. **Month 2**: Expand to team repos, collect metrics\n", + "5. **Month 3**: Optimize based on feedback, version templates\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## πŸ“ˆ Track Your Progress\n", + "\n", + "### Self-Assessment Questions\n", + "\n", + "After completing Module 3, reflect on these questions:\n", + "\n", + "1. **Can I design code review prompts with task decomposition?**\n", + " - Do you understand how to break reviews into steps (analyze β†’ assess β†’ suggest β†’ verdict)?\n", + " - Can you create role prompts that match domain expertise?\n", + "\n", + "2. **Can I create test generation templates that identify coverage gaps?**\n", + " - Can you design prompts that compare requirements vs existing tests?\n", + " - Do you know how to structure test specifications (purpose, preconditions, steps, expected)?\n", + "\n", + "3. **Can I build LLM-as-Judge rubrics with weighted criteria?**\n", + " - Can you define evaluation criteria appropriate for your domain?\n", + " - Do you know how to set acceptance thresholds and provide feedback?\n", + "\n", + "4. **Can I parameterize templates for reuse?**\n", + " - Do you know how to identify and mark template variables (`{{placeholder}}`)?\n", + " - Can you document template parameters for team use?\n", + "\n", + "5. **Can I refine templates based on feedback?**\n", + " - Do you understand version control for prompts?\n", + " - Can you A/B test different prompt versions?\n", + "\n", + "6. **Do I understand how to prepare prompts for CI/CD integration?**\n", + " - Can you design structured outputs that tools can parse?\n", + " - Do you know how to chain prompts (generate β†’ judge β†’ act)?\n", + "\n", + "### βœ… Check Off Your Learning Objectives\n", + "\n", + "Review the module objectives and check what you've mastered:\n", + "\n", + "- [ ] **Implement SDLC-focused prompts** for code review, test generation, and documentation\n", + "- [ ] **Design reusable templates** with parameterized sections for specific workflows\n", + "- [ ] **Evaluate prompt effectiveness** using LLM-as-Judge rubrics\n", + "- [ ] **Refine and adapt templates** based on feedback and edge cases\n", + "- [ ] **Apply best practices** for version control, parameterization, and quality assurance\n", + "\n", + "
    \n", + "πŸ’‘ Self-Check:

    \n", + "If you can confidently check off 4+ objectives, you're ready to apply these techniques in production!
    \n", + "If not, revisit the sections where you feel less confident and try the practice activities again.\n", + "
    \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🎊 Module 3 Complete!\n", + "\n", + "
    \n", + "

    πŸŽ‰ Congratulations!

    \n", + "

    \n", + " You've mastered production-ready prompt engineering for SDLC tasks!\n", + "

    \n", + "
    \n", + "\n", + "---\n", + "\n", + "### What You've Accomplished\n", + "\n", + "- βœ… **Applied prompt engineering tactics** to real SDLC scenarios\n", + "- βœ… **Built code review templates** with decomposition and chain-of-thought\n", + "- βœ… **Created test generation workflows** that identify coverage gaps\n", + "- βœ… **Implemented LLM-as-Judge** for quality assurance\n", + "- βœ… **Designed reusable templates** with parameterization\n", + "- βœ… **Learned best practices** for production deployment\n", + "\n", + "### πŸ”‘ Key Takeaways\n", + "\n", + "**1. Combine Tactics Strategically**\n", + "- Real-world prompts use multiple tactics together\n", + "- **Role + Structure + CoT + Judge = Robust workflow**\n", + "- Each tactic amplifies the others\n", + "\n", + "**2. Templates Enable Scale**\n", + "- Parameterized templates reduce prompt drift\n", + "- Version control ensures consistency over time\n", + "- Team collaboration becomes possible and repeatable\n", + "- Documentation turns templates into shared assets\n", + "\n", + "**3. Quality Assurance Matters**\n", + "- LLM-as-Judge catches issues early, before they reach production\n", + "- Rubrics encode team standards in executable form\n", + "- Iterative refinement improves quality over time\n", + "- Metrics provide objective feedback loops\n", + "\n", + "**4. Prepare for Production**\n", + "- Test templates thoroughly on diverse scenarios\n", + "- Document parameters and usage clearly\n", + "- Monitor performance (latency, cost, quality scores)\n", + "- Iterate based on real-world feedback\n", + "\n", + "---\n", + "\n", + "### πŸ“š What's Next?\n", + "\n", + "**Apply What You've Learned:**\n", + "\n", + "1. **Create templates for your team** \n", + " - Start with code review or test generation\n", + " - Adapt examples from this module to your domain\n", + " - Share with 2-3 teammates for feedback\n", + "\n", + "2. **Integrate into your workflow**\n", + " - Begin with manual use in daily work\n", + " - Add to CI/CD when templates are stable\n", + " - Consider IDE extensions or Slack bots\n", + "\n", + "3. **Collect feedback and iterate**\n", + " - Track what works and what doesn't\n", + " - Use LLM-as-Judge for objective metrics\n", + " - Version templates as they improve\n", + "\n", + "4. **Share with your team**\n", + " - Build a template library in your repo\n", + " - Document usage patterns and best practices\n", + " - Create a feedback channel for continuous improvement\n", + "\n", + "**Continue Learning:**\n", + "\n", + "- **Module 4**: Integration - Connect prompts to your development workflow (CI/CD, IDE, APIs)\n", + "- **Advanced Topics**: Multi-agent systems, prompt optimization, cost/latency tradeoffs\n", + "- **Community**: Share your templates and learn from others\n", + "\n", + "---\n", + "\n", + "
    \n", + "πŸš€ Ready for Real-World Impact:

    \n", + "You now have the skills to design production-ready prompt engineering workflows for software development. The templates you've learned aren't just exercisesβ€”they're patterns used by engineering teams at scale.

    \n", + "Go build something amazing! 🎯\n", + "
    \n", + "\n", + "---\n", + "\n", + "### πŸ™ Thank You!\n", + "\n", + "Thank you for completing Module 3! Your journey from learning individual tactics to building complete workflows demonstrates real growth in prompt engineering expertise.\n", + "\n", + "**Questions or feedback?** Open an issue in the repository or reach out to the maintainers. We'd love to hear how you're applying these techniques!\n", + "\n", + "**Next:** [Continue to Module 4: Integration](../module-04-integration/README.md)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/01-course/module-03-applications/solutions/activity-3.3-customization-solution.md b/01-course/module-03-applications/solutions/activity-3.3-customization-solution.md new file mode 100644 index 0000000..9017c98 --- /dev/null +++ b/01-course/module-03-applications/solutions/activity-3.3-customization-solution.md @@ -0,0 +1,354 @@ +# Activity 3.3 Solution: Template Customization Challenge + +## βœ… What Makes This Solution Follow Best Practices + +### 1. Performance-Specific Role and Context + +```python +"You are a Senior Performance Engineer specializing in database optimization." +``` + +```xml + +Performance Requirements: Must handle users with 1000+ posts efficiently + +``` + +**Why this matters:** +- Role matches the specific focus area (performance, not general review) +- Context includes scale requirements (1000+ posts) - critical for performance analysis +- Sets clear performance expectations and success criteria + +### 2. Domain-Specific Guidelines + +```xml + +1. Analyze algorithmic complexity (Big-O notation) +2. Identify N+1 query problems +3. Check for caching opportunities +4. Assess memory usage patterns +5. Recommend performance optimizations +6. Estimate performance impact with data size + +``` + +**Why this matters:** +- Not generic "check for issues" - uses performance engineering terminology +- Specific to performance domain (Big-O, N+1, caching) +- Each guideline is measurable and actionable +- Covers the full spectrum of performance concerns + +### 3. Performance-Specific Output Format + +```xml + + O(nΒ²) + O(n) + Use join query with single database call + +``` + +**Why this matters:** +- Includes complexity analysis (quantifiable improvement) +- Shows before/after optimization (clear value proposition) +- Quantifies improvement (not just "faster") +- Enables data-driven decisions + +### 4. Impact Estimation + +``` +6. Estimate performance impact with data size +``` + +**Why this matters:** +- Not just "it's slow" - provides actual numbers +- Example: "With 1000 posts, this takes 30s vs 0.3s optimized" +- Business impact is clear and measurable +- Helps prioritize optimization work + +### 5. Customization Template Pattern + +**The solution demonstrates** how to adapt the base template for different focus areas: +- Example shows performance focus +- Comments explain how to adapt for SRE or API design +- Demonstrates the pattern that applies to any domain + +## πŸ“Š Expected AI Output for Performance Review + +### Complexity Analysis + +``` +Current Analysis: +- O(n) for loop iterations over user.post_ids +- Each Post.query.get(post_id) is a separate DB query = O(1) per query +- Each Like.query.filter_by(post_id).count() is another query = O(1) per query +- Total: O(n) loop Γ— 2 queries per iteration = 2n database round trips + +Performance Impact: +- With network latency: ~50ms per query +- For 1000 posts: 50ms Γ— 2000 queries = 100 seconds! +- This is O(n) in queries, which is unacceptable for production +``` + +### N+1 Query Problem Identification + +``` +Line 4-5: Loop makes n queries for posts (N+1 problem #1) + for post_id in user.post_ids: + post = Post.query.get(post_id) # Query executed n times + +Line 6: Makes n additional queries for likes (N+1 problem #2) + like_count = Like.query.filter_by(post_id=post.id).count() # Query executed n times + +Total queries: 1 (user) + 1000 (posts) + 1000 (likes) = 2001 queries +Classic N+1 problem (actually N+N+1 in this case) +``` + +### Optimization Recommendation + +```python +def get_user_posts_with_likes_optimized(user_id): + """Optimized version using SQL joins""" + from sqlalchemy import func + + # Single query with join and aggregation + posts_with_likes = db.session.query( + Post, + func.count(Like.id).label('like_count') + ).join( + Like, Like.post_id == Post.id, isouter=True + ).filter( + Post.user_id == user_id + ).group_by( + Post.id + ).all() + + return posts_with_likes + +# Performance improvement: +# Before: 2001 queries Γ— 50ms = 100 seconds +# After: 1 query Γ— 50ms = 0.05 seconds +# Improvement: 2000x faster! +``` + +### Additional Performance Considerations + +**Caching Strategy:** +```python +# Add Redis caching for frequently accessed data +@cache.memoize(timeout=300) # Cache for 5 minutes +def get_user_posts_with_likes_cached(user_id): + return get_user_posts_with_likes_optimized(user_id) + +# Result: Subsequent requests served from cache in ~1ms +``` + +**Database Indexing:** +```sql +-- Add indexes to speed up joins +CREATE INDEX idx_posts_user_id ON posts(user_id); +CREATE INDEX idx_likes_post_id ON likes(post_id); + +-- Further reduces query time from 50ms to ~5ms +``` + +**Pagination:** +```python +# Don't load all 1000 posts at once +def get_user_posts_paginated(user_id, page=1, per_page=20): + return get_user_posts_with_likes_optimized(user_id)\ + .limit(per_page)\ + .offset((page-1) * per_page) + +# Load 20 posts at a time instead of 1000 +# Initial load: 5ms instead of 50ms +``` + +## πŸ”„ Adaptation Patterns for Other Focus Areas + +### For SRE/Observability Review + +```python +# Template customization for SRE focus +ReviewTemplate( + role="Site Reliability Engineer specializing in observability", + + guidelines=[ + "Check for structured logging (not print statements)", + "Verify metrics/tracing instrumentation", + "Assess error handling and graceful degradation", + "Review retry logic and circuit breakers", + "Check for proper resource cleanup (connections, files)", + "Verify timeouts and deadlines are set" + ], + + output_format=""" + + 1-5 + Missing metrics + Issues found + Circuit breakers, retries, etc. + Specific improvements + + """ +) +``` + +**Example SRE Findings:** +```markdown +❌ Line 7: Using print() instead of structured logger + Recommendation: Use logger.info() with structured fields + +❌ Line 12: No timeout on database query + Recommendation: Add timeout=5 to prevent hanging + +❌ No circuit breaker on external API call + Recommendation: Wrap with @circuit_breaker(failure_threshold=5) + +βœ… Good: Proper error handling with specific exceptions +``` + +### For API Design Review + +```python +# Template customization for API design +ReviewTemplate( + role="API Architect specializing in RESTful design", + + guidelines=[ + "Verify RESTful conventions (proper HTTP methods)", + "Check status codes are semantically correct", + "Assess API versioning strategy", + "Review error response structure", + "Check for backward compatibility", + "Verify request/response schemas", + "Assess pagination and filtering" + ], + + output_format=""" + + Issues with REST conventions + List of breaking changes + Version strategy evaluation + Error response quality + API improvements + + """ +) +``` + +**Example API Design Findings:** +```markdown +❌ Line 3: POST /api/user//profile for updates + Issue: Should use PUT or PATCH for updates, not POST + Recommendation: Change to PATCH /api/user//profile + +❌ Line 18: Returns 200 for errors + Issue: Should return 4xx/5xx status codes for errors + Recommendation: Return 400 for validation errors, 404 for not found + +❌ No API versioning in URL or headers + Issue: Can't evolve API without breaking clients + Recommendation: Add /v1/ to URL path: /api/v1/user//profile + +⚠️ Breaking change: Removed 'bio' field from response + Impact: Existing clients will fail + Recommendation: Deprecate gradually with version bump +``` + +### For Frontend/React Review + +```python +# Template customization for React performance +ReviewTemplate( + role="Senior Frontend Engineer specializing in React performance", + + guidelines=[ + "Check for unnecessary re-renders", + "Verify proper use of useCallback/useMemo", + "Identify missing key props in lists", + "Assess component code splitting opportunities", + "Check for useEffect dependency issues", + "Review state management patterns", + "Identify bundle size optimization opportunities" + ], + + output_format=""" + + Unnecessary re-renders identified + useEffect, useMemo, useCallback issues + 1-5 + Estimated bundle size impact + Specific React optimizations + + """ +) +``` + +## 🎯 Key Takeaway: Same Structure, Different Focus + +The power of template customization is that you keep the same **structure** but change the **domain-specific elements**: + +| Template Element | Base | Performance | SRE | API Design | +|------------------|------|-------------|-----|------------| +| **Role** | Senior Engineer | Performance Engineer | Site Reliability Engineer | API Architect | +| **Guidelines** | General best practices | Big-O, N+1, caching | Logging, metrics, resilience | REST conventions, versioning | +| **Output** | Generic issues | Complexity analysis | Observability gaps | API compliance | +| **Severity** | blocker/major/minor | Performance impact (seconds/ms) | Incident risk (P0-P4) | Breaking changes vs non-breaking | +| **Evidence** | Code snippets | Profiling data, query counts | Missing telemetry | HTTP violations | +| **Success Criteria** | Code quality | Latency targets (< 100ms) | SLOs (99.9% uptime) | API standards compliance | + +## πŸ’‘ When to Create Custom Templates + +Create domain-specific templates when: + +1. **Specialized Knowledge Required** + - Security, performance, accessibility, compliance + - Domain experts have specific checklists + +2. **Different Standards Apply** + - Mobile vs web performance targets + - Public API vs internal API standards + - Real-time systems vs batch processing + +3. **Regulatory Requirements** + - HIPAA compliance reviews + - GDPR privacy checks + - Financial services regulations + +4. **Tool Integration Needed** + - Output must feed into specific tools + - Different metrics tracked per domain + - Integration with domain-specific dashboards + +## πŸš€ Implementation Strategy + +### Step 1: Start with Base Template +Use the general code review template as foundation + +### Step 2: Identify Domain-Specific Elements +- What expertise does this domain require? +- What specific issues should be caught? +- What terminology is used? +- What metrics matter? + +### Step 3: Customize Role and Guidelines +- Change role to domain expert +- Replace generic guidelines with domain-specific ones +- Add domain terminology and frameworks + +### Step 4: Adapt Output Format +- Include domain-specific metadata +- Add measurements relevant to domain +- Structure for domain-specific tools + +### Step 5: Test and Refine +- Test on known issues in the domain +- Measure detection accuracy +- Gather feedback from domain experts +- Iterate based on real usage + +--- + +**Remember**: Customization doesn't mean starting from scratch. The core prompting techniques (role, structure, CoT, evidence requirements) remain the same. You're just changing the domain expertise and evaluation criteria. + diff --git a/01-course/module-03-applications/solutions/activity-3.4-judge-solution.md b/01-course/module-03-applications/solutions/activity-3.4-judge-solution.md new file mode 100644 index 0000000..7d9adb3 --- /dev/null +++ b/01-course/module-03-applications/solutions/activity-3.4-judge-solution.md @@ -0,0 +1,416 @@ +# Activity 3.4 Solution: Quality Evaluation with LLM-as-Judge + +## βœ… What Makes This Solution Follow Best Practices + +### 1. Intentionally Low-Quality Input + +```xml +Function could be improved +The code is not optimal +Make it better +``` + +**Why this matters:** +- Tests the judge's ability to catch poor quality outputs +- Demonstrates what NOT to accept in production +- Realistic: Some AI outputs will be vague or unhelpful +- Validates that the quality gate actually works + +### 2. Context-Specific Rubric + +```xml + +1. Specificity (40%): Are issues concrete with exact evidence? +2. Actionability (30%): Can developer immediately act? +3. Technical Accuracy (20%): Are issues technically sound? +4. Completeness (10%): Are major categories covered? + +``` + +**Why this matters:** +- Weights match your team's priorities (example shows CI/CD context) +- Different contexts need different weightings +- CI/CD needs high specificity (automated posting) +- Documentation review might weight Communication higher + +### 3. Explicit Scoring Scale + +``` +- 5: Excellent - Ready for production +- 4: Good - Minor improvements needed +- 3: Acceptable - Meets minimum bar +- 2: Poor - Significant issues, needs revision +- 1: Unacceptable - Reject and regenerate +``` + +**Why this matters:** +- Clear definitions prevent ambiguity and inconsistent scoring +- Maps to actionable decisions (not subjective "good/bad") +- Consistent across evaluations (same criteria every time) +- Enables metric tracking over time + +### 4. Actionable Thresholds + +``` +- ACCEPT (β‰₯4.0): Post to PR +- REVISE (2.5-3.9): Regenerate with specific guidance +- REJECT (<2.5): Discard, use different approach +``` + +**Why this matters:** +- Clear decision points enable automation +- Each outcome has a defined action +- Can be implemented as: `if score >= 4.0: post_to_pr()` +- Middle tier (REVISE) triggers improvement loop instead of failure + +### 5. Improvement Feedback Required + +```xml + +``` + +**Why this matters:** +- Not just score, but HOW to improve +- Enables iterative refinement +- Feeds back into prompt improvement process +- Creates a learning loop + +## πŸ“Š Expected Judge Evaluation + +### Detailed Scoring Breakdown + +#### Specificity Score: 1/5 ❌ + +**Rationale:** +- "Function could be improved" is completely vague +- "The code is not optimal" provides zero evidence +- No line numbers cited +- No specific issues identified +- No code snippets shown + +**Improvement Needed:** +``` +Replace with: "Line 2: Missing input validation allows negative discount_percent values" +Include: Exact line numbers, quoted code, specific issue description +``` + +#### Actionability Score: 1/5 ❌ + +**Rationale:** +- "Make it better" is not actionable +- Developer has no idea WHAT to change +- No code examples provided +- No specific steps given +- Cannot implement without additional research + +**Improvement Needed:** +``` +Replace with: "Add validation before calculation: +if not 0 <= discount_percent <= 100: + raise ValueError('Discount must be between 0 and 100%')" +Include: Exact code changes, why they're needed, examples +``` + +#### Technical Accuracy Score: 2/5 ⚠️ + +**Rationale:** +- "medium severity" seems reasonable but no justification provided +- Can't verify if assessment is technically correct +- No explanation of WHY it's medium vs high/low +- Might be inaccurate but impossible to tell + +**Improvement Needed:** +``` +Include reasoning: "Severity: major - Input validation missing can lead to +runtime errors (negative prices, divide by zero if used in calculations), +affecting production reliability" +``` + +#### Completeness Score: 2/5 ⚠️ + +**Rationale:** +- Only one vague issue found (likely incomplete) +- No security analysis (SQL injection? XSS?) +- No performance considerations +- No maintainability checks (type hints, docstrings) +- No correctness verification + +**Improvement Needed:** +``` +Systematic check required: +βœ“ Security issues +βœ“ Performance concerns +βœ“ Correctness bugs +βœ“ Maintainability improvements +βœ“ Error handling +``` + +### Weighted Score Calculation + +```python +# Calculate weighted score +specificity_score = 1 +actionability_score = 1 +technical_accuracy_score = 2 +completeness_score = 2 + +weighted_total = ( + (specificity_score * 0.40) + # 0.40 + (actionability_score * 0.30) + # 0.30 + (technical_accuracy_score * 0.20) + # 0.40 + (completeness_score * 0.10) # 0.20 +) +# Total: 1.3 +``` + +### Decision: REJECT (<2.5) 🚫 + +**Feedback for Improvement:** + +```markdown +This review fails to meet minimum quality standards and must be rejected. + +Critical Issues: +1. ❌ Vague descriptions lack specificity + - Current: "Function could be improved" + - Required: "Line 2: Function lacks input validation for discount_percent parameter" + +2. ❌ Recommendations not actionable + - Current: "Make it better" + - Required: "Add validation: if not 0 <= discount_percent <= 100: raise ValueError(...)" + +3. ❌ No evidence provided + - Current: "The code is not optimal" + - Required: Quote exact code, cite line numbers, explain WHY it's an issue + +4. ❌ Incomplete coverage + - Only 1 issue found, likely missing critical problems + - Should cover: security, performance, correctness, maintainability + +Example of Acceptable Quality: + + major + Missing input validation allows invalid discount percentages + Line 2: Function accepts any numeric value for discount_percent. + Values > 100 result in negative prices. Values < 0 increase price. + Add validation before calculation: + + if not 0 <= discount_percent <= 100: + raise ValueError(f"Discount must be between 0-100%, got {discount_percent}%") + + This prevents invalid inputs and provides clear error messages. + + +DO NOT POST THIS REVIEW. Regenerate with specific, actionable feedback. +``` + +## πŸ’» Production Implementation + +### Complete Automated Quality Gate + +```python +from dataclasses import dataclass +from typing import Optional +import re + +@dataclass +class JudgeResult: + """Structured judge evaluation result""" + specificity_score: float + actionability_score: float + technical_accuracy_score: float + completeness_score: float + weighted_total: float + decision: str # ACCEPT, REVISE, REJECT + feedback: str + +def parse_judge_output(judge_response: str) -> JudgeResult: + """Parse judge XML/text response into structured result""" + # In production, use proper XML parsing + # This is simplified for demonstration + + # Extract scores using regex (production should use XML parser) + specificity = float(re.search(r'Specificity.*?(\d+)/5', judge_response).group(1)) + actionability = float(re.search(r'Actionability.*?(\d+)/5', judge_response).group(1)) + technical = float(re.search(r'Technical.*?(\d+)/5', judge_response).group(1)) + completeness = float(re.search(r'Completeness.*?(\d+)/5', judge_response).group(1)) + + # Calculate weighted score + weighted = ( + (specificity * 0.40) + + (actionability * 0.30) + + (technical * 0.20) + + (completeness * 0.10) + ) + + # Determine decision + if weighted >= 4.0: + decision = "ACCEPT" + elif weighted >= 2.5: + decision = "REVISE" + else: + decision = "REJECT" + + # Extract feedback + feedback = re.search(r'(.*?)', judge_response, re.DOTALL).group(1) + + return JudgeResult( + specificity_score=specificity, + actionability_score=actionability, + technical_accuracy_score=technical, + completeness_score=completeness, + weighted_total=weighted, + decision=decision, + feedback=feedback + ) + +def automated_quality_gate(ai_review: str, max_retries: int = 2) -> dict: + """ + Complete quality gate with retry logic. + + Returns: + dict with 'status', 'review', 'score', 'attempts' + """ + + for attempt in range(max_retries + 1): + # Evaluate with judge + judge_result = evaluate_with_judge(ai_review) + score = judge_result.weighted_total + + # Log attempt + log_metric(f"judge_score_attempt_{attempt}", score) + + if score >= 4.0: + # HIGH QUALITY - Accept and post + post_to_pr(ai_review) + log_metric("ai_review_accepted", score) + + return { + "status": "POSTED", + "review": ai_review, + "score": score, + "attempts": attempt + 1 + } + + elif score >= 2.5 and attempt < max_retries: + # MEDIUM QUALITY - Regenerate with feedback + print(f"⚠️ Score {score} < 4.0. Regenerating with feedback...") + + # Regenerate with specific improvements + ai_review = regenerate_with_feedback( + original=ai_review, + feedback=judge_result.feedback, + attempt=attempt + ) + + # Continue to next iteration + continue + + elif score < 2.5: + # LOW QUALITY - Reject and fallback to human + log_metric("ai_review_rejected", score) + flag_for_human_review(reason=judge_result.feedback) + + return { + "status": "HUMAN_REVIEW_REQUIRED", + "review": None, + "score": score, + "attempts": attempt + 1 + } + + else: + # MAX RETRIES REACHED + log_metric("ai_review_max_retries", score) + flag_for_human_review(reason="Max retries exceeded") + + return { + "status": "NEEDS_HUMAN", + "review": ai_review, + "score": score, + "attempts": attempt + 1 + } + + return {"status": "FAILED", "review": None, "score": 0, "attempts": max_retries + 1} + + +# Example usage in CI/CD +def ci_cd_review_workflow(pr_diff: str): + """Complete CI/CD workflow with quality gate""" + + # Step 1: Generate review + ai_review = generate_code_review(pr_diff) + + # Step 2: Quality gate with retry + result = automated_quality_gate(ai_review, max_retries=2) + + # Step 3: Handle outcome + if result['status'] == 'POSTED': + print(f"βœ… AI review posted (score: {result['score']}, attempts: {result['attempts']})") + + elif result['status'] == 'NEEDS_HUMAN': + print(f"⚠️ AI review quality insufficient. Human review requested.") + send_slack_notification( + channel="#code-reviews", + message=f"PR requires human review. AI score: {result['score']}" + ) + + elif result['status'] == 'HUMAN_REVIEW_REQUIRED': + print(f"❌ AI review rejected. Flagging for human review.") + add_pr_label("needs-human-review") +``` + +### Monitoring Dashboard + +```python +def generate_quality_dashboard(): + """Track LLM-as-Judge metrics over time""" + + metrics = { + "acceptance_rate": count_accepted() / count_total(), + "average_score": mean(all_scores()), + "score_distribution": histogram(all_scores()), + "retry_rate": count_retries() / count_total(), + "human_fallback_rate": count_human_reviews() / count_total(), + + # By score component + "specificity_trend": timeseries("specificity"), + "actionability_trend": timeseries("actionability"), + + # Efficiency metrics + "avg_attempts_per_review": mean(attempts_per_review()), + "cost_per_accepted_review": calculate_cost(), + } + + return metrics + +# Example output: +# Acceptance Rate: 76% (↑ from 68% last week) +# Average Score: 4.1 / 5.0 +# Score Distribution: [5β˜…: 23%, 4β˜…: 53%, 3β˜…: 18%, 2β˜…: 5%, 1β˜…: 1%] +# Human Fallback: 6% (target < 10%) +# Avg Cost: $0.08 per accepted review +``` + +## 🎯 Key Takeaway + +LLM-as-Judge isn't just scoring - it's a **complete quality assurance system**: + +1. **Automated QA** - Catches poor outputs before they reach users +2. **Feedback Loop** - Provides actionable improvement guidance +3. **Decision Automation** - Enables if/else logic: accept/revise/reject +4. **Continuous Improvement** - Metrics guide prompt refinement +5. **Cost Control** - Prevents wasting human time on low-quality AI outputs +6. **Trust Building** - Teams gain confidence in AI-assisted workflows + +### Success Metrics to Track + +- **Acceptance Rate**: % of AI outputs that pass quality gate (target: >70%) +- **Average Score**: Mean judge score (target: >4.0) +- **Human Fallback Rate**: % requiring human review (target: <10%) +- **Cost per Review**: Including retries (target: <$0.10) +- **Score Trend**: Improving over time as prompts are refined + +--- + +**Remember**: The judge is only as good as its rubric. Invest time in defining evaluation criteria that match your team's standards and priorities. + diff --git a/session_1_introduction_and_basics.ipynb b/session_1_introduction_and_basics.ipynb new file mode 100644 index 0000000..cc6e79d --- /dev/null +++ b/session_1_introduction_and_basics.ipynb @@ -0,0 +1,1048 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "**You have an AI model. It seems to work. But how do you actually know?​**\n", + "\n", + "### Common Pain Points:\n", + "- **Retrieval fails silently**: Gets irrelevant chunks but you don't notice\n", + "- **Context gets lost**: Important info split across chunks disappears \n", + "- **Hallucination persists**: LLM makes up facts even with good sources\n", + "- **Quality varies wildly**: Same question, different quality answers each time\n", + "- **Manual checking doesn't scale**: Can't manually verify thousands of responses\n", + "\n", + "### The $10M Question:\n", + "*\"How do you evaluate AI systems that generate nuanced, contextual responses at scale?\"*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Why Evaluations Are Critical (Real-World Impact)\n", + "\n", + "print(\"🚨 HIGH-STAKES AI DEPLOYMENT REALITY\")\n", + "print(\"=\" * 45)\n", + "\n", + "deployment_stats = {\n", + " \"Customer Service Bots\": \"Handle millions of conversations daily\",\n", + " \"Content Moderation\": \"Process billions of social media posts\", \n", + " \"Medical AI\": \"Assist in patient diagnosis and treatment\",\n", + " \"Legal AI\": \"Evaluate document relevance in court cases\",\n", + " \"Financial AI\": \"Determine loan approvals and credit decisions\",\n", + " \"Educational AI\": \"Grade student work and provide feedback\"\n", + "}\n", + "\n", + "print(\"Current AI Scale:\")\n", + "for system, impact in deployment_stats.items():\n", + " print(f\"β€’ {system}: {impact}\")\n", + "\n", + "print(\"\\nπŸ’° COST OF POOR EVALUATION:\")\n", + "print(\"-\" * 30)\n", + "\n", + "failure_costs = {\n", + " \"Customer Churn\": \"23% abandon AI tools after bad experience\",\n", + " \"Support Costs\": \"Poor AI increases human tickets by 40%\", \n", + " \"Brand Damage\": \"AI failures become viral social content\",\n", + " \"Legal Liability\": \"Biased systems face discrimination lawsuits\",\n", + " \"Regulatory Risk\": \"Can't prove compliance without measurement\"\n", + "}\n", + "\n", + "for cost_type, impact in failure_costs.items():\n", + " print(f\"β€’ {cost_type}: {impact}\")\n", + "\n", + "print(\"\\n🎯 THE BOTTOM LINE:\")\n", + "print(\"Without proper evaluation, AI systems fail silently at scale.\")\n", + "print(\"LLM judges provide the solution - but only if built correctly!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ“Š Traditional Evaluation Methods\n", + "\n", + "### Human Evaluation Methods:\n", + "- **Expert assessment**: Manual rating but $5-50 per evaluation\n", + "- **Weeks to scale**: Gold standard quality, impossible timeline\n", + "- **Subjective bias**: Different evaluators, different standards\n", + "- **Can't handle volume**: Thousands of outputs daily\n", + "\n", + "### Reference-Based Automated Metrics:\n", + "- **Exact Match**: Perfect matches only, zero tolerance\n", + "- **F1 Score**: Token overlap, misses meaning\n", + "- **BLEU**: Translation metric, ignores factual accuracy\n", + "- **ROUGE**: Content recall, can't detect hallucinations\n", + "\n", + "### Critical Limitations:\n", + "- **Rigid scoring**: Correct rephrases score poorly\n", + "- **Missing hallucination detection**: Can't spot made-up facts\n", + "- **Context blind**: Ignores document grounding\n", + "- **Too slow**: Can't monitor production systems real-time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exact Match (EM)\n", + "\n", + "Definition: Exact Match is a binary metric that determines if a generated text is perfectly identical to a reference text. It is a very strict measure, returning 1 (true) only if every character matches, including case, punctuation, and spacing; otherwise, it returns 0 (false). It has \"zero tolerance\" for any deviation.\n", + "\n", + "\n", + "Formula:\n", + "$$ EM(R, C) = \\begin{cases} 1 & \\text{if } R = C \\ 0 & \\text{if } R \\neq C \\end{cases} $$\n", + "Where:\n", + "\n", + "\n", + "$R$ is the Reference text.\n", + "$C$ is the Candidate (generated) text.\n", + "\n", + "Exact Match is straightforward to implement manually or can be found in some NLP toolkits." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Reference: 'The capital of France is Paris.'\n", + "Candidate 1: 'The capital of France is Paris.' -> EM Score: 1\n", + "Candidate 2: 'The capital of France is paris.' -> EM Score: 0\n", + "Candidate 3: 'Paris is the capital of France.' -> EM Score: 0\n" + ] + } + ], + "source": [ + "def exact_match(reference: str, candidate: str) -> int:\n", + " \"\"\"\n", + " Calculates the Exact Match score between a reference and a candidate string.\n", + " Returns 1 if they are identical, 0 otherwise.\n", + " \"\"\"\n", + " return 1 if reference == candidate else 0\n", + "\n", + "# Working Example\n", + "reference_em = \"The capital of France is Paris.\"\n", + "\n", + "candidate_em_1 = \"The capital of France is Paris.\"\n", + "candidate_em_2 = \"The capital of France is paris.\"\n", + "candidate_em_3 = \"Paris is the capital of France.\"\n", + "\n", + "print(f\"Reference: '{reference_em}'\")\n", + "print(f\"Candidate 1: '{candidate_em_1}' -> EM Score: {exact_match(reference_em, candidate_em_1)}\")\n", + "print(f\"Candidate 2: '{candidate_em_2}' -> EM Score: {exact_match(reference_em, candidate_em_2)}\")\n", + "print(f\"Candidate 3: '{candidate_em_3}' -> EM Score: {exact_match(reference_em, candidate_em_3)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### F1 Score\n", + "\n", + "Definition: The F1 Score is the harmonic mean of Precision and Recall. In the context of NLP text generation evaluation (especially for tasks like question answering where token overlap is important), it measures the overlap between the references in the generated text and the reference text.\n", + "\n", + "\n", + "Precision: Measures how many of the references in the generated text are also present in the reference text. It answers: \"Of all the references I generated, how many were correct?\"\n", + "Recall: Measures how many of the references in the reference text were captured by the generated text. It answers: \"Of all the correct references, how many did I generate?\"\n", + "\n", + "Formulas:\n", + "Let:\n", + "\n", + "\n", + "$TP$ (True Positives) = Number of references common to both the candidate and reference texts.\n", + "$FP$ (False Positives) = Number of references in the candidate text but not in the reference text.\n", + "$FN$ (False Negatives) = Number of references in the reference text but not in the candidate text.\n", + "\n", + "$$ Precision = \\frac{TP}{TP + FP} = \\frac{\\text{Number of matching references}}{\\text{Total references in candidate}} $$\n", + "$$ Recall = \\frac{TP}{TP + FN} = \\frac{\\text{Number of matching references}}{\\text{Total references in reference}} $$\n", + "$$ F1 = 2 \\times \\frac{Precision \\times Recall}{Precision + Recall} $$\n", + "\n", + "For token-level F1, we often use sklearn.metrics.f1_score after converting strings to sets of references." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Reference references: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']\n", + "Candidate references: ['a', 'quick', 'fox', 'jumps', 'over', 'a', 'dog.']\n", + "F1 Score (token-level): 0.625\n" + ] + } + ], + "source": [ + "from collections import Counter\n", + "\n", + "def calculate_f1_score_references(reference_references: list, candidate_references: list) -> float:\n", + " \"\"\"\n", + " Calculates the token-level F1 score between a reference and a candidate list of references.\n", + " \"\"\"\n", + " common = Counter(reference_references) & Counter(candidate_references)\n", + " num_common = sum(common.values())\n", + "\n", + " if num_common == 0:\n", + " return 0.0\n", + "\n", + " precision = num_common / len(candidate_references)\n", + " recall = num_common / len(reference_references)\n", + "\n", + " f1 = (2 * precision * recall) / (precision + recall)\n", + " return f1\n", + "\n", + "# Working Example\n", + "reference_f1 = \"The quick brown fox jumps over the lazy dog.\"\n", + "candidate_f1 = \"A quick fox jumps over a dog.\"\n", + "\n", + "# Tokenize the sentences (simple split for demonstration)\n", + "reference_references_f1 = reference_f1.lower().split()\n", + "candidate_references_f1 = candidate_f1.lower().split()\n", + "\n", + "print(f\"\\nReference references: {reference_references_f1}\")\n", + "print(f\"Candidate references: {candidate_references_f1}\")\n", + "print(f\"F1 Score (token-level): {calculate_f1_score_references(reference_references_f1, candidate_references_f1):.3f}\")\n", + "\n", + "# Using sklearn for comparison (requires converting to binary labels, which is less direct for this specific use case)\n", + "# For direct token overlap, the custom function above is more illustrative.\n", + "# If using sklearn, it's typically for classification where each token is a class." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## What is LLM as a Judge?\n", + "\n", + "Large Language Models (LLMs) as judges represent a paradigm where we leverage the reasoning capabilities of LLMs to evaluate, score, and assess various types of content, conversations, or decisions.\n", + "\n", + "### Key Characteristics:\n", + "- **Automated Evaluation**: Replace human evaluators in specific contexts\n", + "- **Consistent Scoring**: Provide standardized assessment criteria\n", + "- **Scalable Assessment**: Handle large volumes of evaluation tasks\n", + "- **Multi-dimensional Analysis**: Evaluate multiple criteria simultaneously\n", + "\n", + "### Why LLM Judges Changed Everything:\n", + "- **Semantic Understanding**: Recognizes paraphrasing and meaning beyond keywords\n", + "- **Scalable Human-like Judgment**: Thousands of evaluations in minutes vs weeks\n", + "- **Reference-free Evaluation**: Can assess faithfulness without ground truth\n", + "- **Contextual Assessment**: Considers domain expertise and user intent" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "βœ… ChatOllama initialized with llama3.1:8b model\n" + ] + } + ], + "source": [ + "# Setup and imports\n", + "import os\n", + "import json\n", + "import pandas as pd\n", + "from typing import Dict, List, Any, Optional\n", + "from langchain_ollama import ChatOllama\n", + "from langchain_core.messages import HumanMessage, SystemMessage\n", + "\n", + "# Initialize LLM\n", + "try:\n", + " llm = ChatOllama(model=\"llama3.1:8b\", temperature=0)\n", + " llm.invoke(\"Hello World!\")\n", + " print(\"βœ… ChatOllama initialized with llama3.1:8b model\")\n", + "except Exception as e:\n", + " print(f\"❌ Failed to initialize ChatOllama: {e}\")\n", + " print(\"Please make sure Ollama is installed and running with llama3.1 model\")" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Sample Text: \n", + "The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.\n", + "It's commonly used for testing fonts and keyboards.\n", + "\n", + "\n", + "Evaluation Criteria:\n", + "- Clarity: How clear and understandable is the text?\n", + "- Informativeness: How much useful information does it provide?\n", + "- Engagement: How engaging is the content for readers?\n" + ] + } + ], + "source": [ + "# Simple example of LLM evaluation concept\n", + "sample_text = \"\"\"\n", + "The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.\n", + "It's commonly used for testing fonts and keyboards.\n", + "\"\"\"\n", + "\n", + "evaluation_criteria = {\n", + " \"clarity\": \"How clear and understandable is the text?\",\n", + " \"informativeness\": \"How much useful information does it provide?\",\n", + " \"engagement\": \"How engaging is the content for readers?\"\n", + "}\n", + "\n", + "print(\"Sample Text:\", sample_text)\n", + "print(\"\\nEvaluation Criteria:\")\n", + "for criterion, description in evaluation_criteria.items():\n", + " print(f\"- {criterion.title()}: {description}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "πŸ€– LLM EVALUATION RESULTS\n", + "\n", + "🎯 Evaluating: Clarity\n", + "----------------------------------------\n", + "LLM Response:\n", + "Score: 9/10\n", + "Reasoning: The text is clear and easy to understand, but it assumes some prior knowledge about the purpose of the sentence. A reader who has never heard of this sentence before might not fully grasp its significance or why it's used for testing fonts and keyboards. However, the language itself is simple and straightforward, making it accessible to a wide range of readers.\n", + "\n", + "🎯 Evaluating: Informativeness\n", + "----------------------------------------\n", + "LLM Response:\n", + "Score: 6/10\n", + "Reasoning: The text provides some useful information about the sentence, specifically its use for testing fonts and keyboards. However, it doesn't provide much depth or context beyond that. It also assumes prior knowledge of why this particular sentence is significant (i.e., containing all letters of the alphabet), which limits its usefulness to readers who are already familiar with this fact.\n", + "\n", + "🎯 Evaluating: Engagement\n", + "----------------------------------------\n", + "LLM Response:\n", + "Score: 2/10\n", + "Reasoning: The content is dry and lacks any narrative or emotional appeal. It's primarily informative, stating a fact about the sentence's composition and its practical application. While it may be interesting for those who appreciate linguistic trivia, it's unlikely to engage readers on an emotional level or spark their curiosity in a significant way.\n" + ] + } + ], + "source": [ + "print(\"πŸ€– LLM EVALUATION RESULTS\")\n", + "# Now let's use the LLM to evaluate the text against each criterion\n", + "for criterion, description in evaluation_criteria.items():\n", + " print(f\"\\n🎯 Evaluating: {criterion.title()}\")\n", + " print(\"-\" * 40)\n", + " \n", + " # Create evaluation prompt\n", + " evaluation_prompt = f\"\"\"\n", + "Please evaluate the following text based on this criterion: {description}\n", + "\n", + "Text to evaluate: {sample_text.strip()}\n", + "\n", + "Provide a score from 1-10 and a brief explanation of your reasoning.\n", + "Format your response as:\n", + "Score: X/10\n", + "Reasoning: [Your explanation]\n", + "\"\"\"\n", + " \n", + " # Get LLM evaluation\n", + " try:\n", + " response = llm.invoke(evaluation_prompt)\n", + " print(f\"LLM Response:\\n{response.content}\")\n", + " except Exception as e:\n", + " print(f\"❌ Error getting evaluation: {e}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Applications Across Domains\n", + "\n", + "### Legal and Judicial Applications\n", + "- **Document Relevance Scoring**: Assess relevance of legal documents to cases\n", + "- **Case Law Analysis**: Evaluate similarity between legal precedents\n", + "- **Judicial Decision Support**: Assist in evidence evaluation and consistency checking\n", + "\n", + "### Content Quality Evaluation\n", + "- **Academic Paper Review**: Automated initial screening of research papers\n", + "- **Content Moderation**: Scale content review for platforms\n", + "- **Customer Service Quality**: Evaluate support interactions\n", + "\n", + "### Conversation Assessment\n", + "- **Chatbot Performance**: Evaluate AI assistant responses\n", + "- **Human-likeness Detection**: Assess naturalness of generated conversations\n", + "- **Training Data Quality**: Validate synthetic conversation datasets" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "πŸ›οΈ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring\n", + "============================================================\n", + "Case: Personal injury lawsuit: slip and fall at grocery store\n", + "Document: Store surveillance footage showing wet floor conditions on day of incident\n", + "\n", + "πŸ€– LLM Evaluation:\n", + "I would rate the document's relevance to the legal case as a 9 out of 10.\n", + "\n", + "The document is directly related to the incident in question, providing visual evidence of the store's condition at the time of the slip and fall. The footage can be used to:\n", + "\n", + "* Support or refute claims made by the plaintiff about the cause of the accident\n", + "* Show that the store was aware of the wet floor conditions and failed to take adequate measures to address them\n", + "* Demonstrate the extent of the hazard posed by the wet floor\n", + "\n", + "The only reason I wouldn't give it a perfect 10 is that, without more context or analysis, we can't be certain what specific details the footage shows. However, in general, store surveillance footage is highly relevant and probative evidence in slip and fall cases like this one.\n", + "\n", + "πŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\n", + "============================================================\n", + "Customer: I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\n", + "Chatbot: Orders usually ship within 5-7 business days. Please wait longer.\n", + "\n", + "πŸ€– LLM Evaluation:\n", + "I would rate the helpfulness of this chatbot response as a 2 out of 10.\n", + "\n", + "The response is unhelpful for several reasons:\n", + "\n", + "* It doesn't acknowledge the customer's concern or frustration about not receiving shipping confirmation.\n", + "* The answer is too vague, stating only that orders \"usually\" ship within 5-7 business days. This doesn't provide any specific information about the status of this particular order.\n", + "* The response essentially tells the customer to wait longer without offering any additional assistance or next steps.\n", + "\n", + "To improve this response, I would suggest the following:\n", + "\n", + "1. Acknowledge the customer's concern: \"Sorry to hear that you haven't received shipping confirmation yet.\"\n", + "2. Provide a more specific answer: \"I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly.\"\n", + "3. Offer additional assistance or next steps: \"If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", + "\n", + "Here's an example of a rewritten response that addresses these issues:\n", + "\n", + "\"Sorry to hear that you haven't received shipping confirmation yet. I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly. If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", + "\n", + "πŸ’‘ Key Takeaways:\n", + "- Legal: Helps prioritize case materials\n", + "- Chatbot: Improves customer service quality\n", + "- All domains need clear evaluation criteria!\n" + ] + } + ], + "source": [ + "# Domain Examples for LLM as Judge\n", + "\n", + "print(\"πŸ›οΈ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring\")\n", + "print(\"=\" * 60)\n", + "\n", + "# Case scenario\n", + "legal_case = \"Personal injury lawsuit: slip and fall at grocery store\"\n", + "sample_document = \"Store surveillance footage showing wet floor conditions on day of incident\"\n", + "\n", + "print(f\"Case: {legal_case}\")\n", + "print(f\"Document: {sample_document}\")\n", + "\n", + "# LLM evaluation\n", + "legal_prompt = f\"\"\"\n", + "Rate this document's relevance to the legal case (1-10 scale):\n", + "\n", + "Case: {legal_case}\n", + "Document: {sample_document}\n", + "\n", + "Provide: Score (1-10) and brief reasoning.\n", + "\"\"\"\n", + "\n", + "try:\n", + " legal_result = llm.invoke(legal_prompt)\n", + " print(f\"\\nπŸ€– LLM Evaluation:\\n{legal_result.content}\")\n", + "except Exception as e:\n", + " print(f\"Error: {e}\")\n", + "\n", + "print(\"\\nπŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\")\n", + "print(\"=\" * 60)\n", + "\n", + "# Customer service scenario\n", + "customer_query = \"I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\"\n", + "chatbot_response = \"Orders usually ship within 5-7 business days. Please wait longer.\"\n", + "\n", + "print(f\"Customer: {customer_query}\")\n", + "print(f\"Chatbot: {chatbot_response}\")\n", + "\n", + "# LLM evaluation\n", + "chatbot_prompt = f\"\"\"\n", + "Evaluate this chatbot response for customer service quality:\n", + "\n", + "Customer Query: {customer_query}\n", + "Chatbot Response: {chatbot_response}\n", + "\n", + "Rate helpfulness (1-10) and suggest improvements.\n", + "\"\"\"\n", + "\n", + "try:\n", + " chatbot_result = llm.invoke(chatbot_prompt)\n", + " print(f\"\\nπŸ€– LLM Evaluation:\\n{chatbot_result.content}\")\n", + "except Exception as e:\n", + " print(f\"Error: {e}\")\n", + "\n", + "print(\"\\nπŸ’‘ Key Takeaways:\")\n", + "print(\"- Legal: Helps prioritize case materials\")\n", + "print(\"- Chatbot: Improves customer service quality\")\n", + "print(\"- All domains need clear evaluation criteria!\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "πŸ“ CONTENT QUALITY EXAMPLE: Academic Paper Review\n", + "============================================================\n", + "Abstract to review:\n", + "\n", + "We surveyed 10 students about social media and mood. Students using social media \n", + "more than 3 hours daily sometimes felt sad. Therefore, social media is bad for \n", + "all teenagers and should be banned.\n", + "\n", + "\n", + "πŸ€– LLM Review:\n", + "I'd rate this abstract a 2 out of 10 in terms of quality.\n", + "\n", + "Here are the main problems I've identified:\n", + "\n", + "1. **Sample size**: The sample size is extremely small, consisting of only 10 students. This is not sufficient to draw any meaningful conclusions about social media use and mood among teenagers.\n", + "2. **Lack of control group**: There is no comparison group or control condition in this study. How do we know that the students who used social media more than 3 hours a day would have felt sad if they hadn't used social media? A control group would help to establish causality.\n", + "3. **Correlation vs. causation**: The abstract implies that using social media causes sadness, but it's possible that there are other factors at play (e.g., students who use social media more may be more prone to depression or anxiety). Correlational studies like this one can't establish cause-and-effect relationships.\n", + "4. **Overly broad conclusion**: The abstract concludes that \"social media is bad for all teenagers and should be banned.\" This is an overly simplistic and sweeping statement, especially given the small sample size and lack of control group.\n", + "5. **Lack of statistical analysis**: There's no mention of any statistical tests or analyses used to examine the relationship between social media use and mood. This makes it difficult to evaluate the validity of the findings.\n", + "\n", + "Overall, this abstract raises more questions than answers, and its conclusions are likely based on a flawed methodology.\n" + ] + } + ], + "source": [ + "print(\"\\nπŸ“ CONTENT QUALITY EXAMPLE: Academic Paper Review\")\n", + "print(\"=\" * 60)\n", + "\n", + "# Sample abstract with obvious flaws\n", + "paper_abstract = \"\"\"\n", + "We surveyed 10 students about social media and mood. Students using social media \n", + "more than 3 hours daily sometimes felt sad. Therefore, social media is bad for \n", + "all teenagers and should be banned.\n", + "\"\"\"\n", + "\n", + "print(f\"Abstract to review:\\n{paper_abstract}\")\n", + "\n", + "# LLM evaluation\n", + "academic_prompt = f\"\"\"\n", + "Review this academic abstract for quality issues:\n", + "\n", + "Abstract: {paper_abstract}\n", + "\n", + "Rate (1-10) and identify main problems with methodology, sample size, or conclusions.\n", + "\"\"\"\n", + "\n", + "try:\n", + " academic_result = llm.invoke(academic_prompt)\n", + " print(f\"\\nπŸ€– LLM Review:\\n{academic_result.content}\")\n", + "except Exception as e:\n", + " print(f\"Error: {e}\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "πŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\n", + "============================================================\n", + "Customer: I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\n", + "Chatbot: Orders usually ship within 5-7 business days. Please wait longer.\n", + "\n", + "πŸ€– LLM Evaluation:\n", + "I would rate the helpfulness of this chatbot response as a 2 out of 10.\n", + "\n", + "The response is unhelpful for several reasons:\n", + "\n", + "* It doesn't acknowledge the customer's concern or frustration about not receiving shipping confirmation.\n", + "* The answer is too vague, stating only that orders \"usually\" ship within 5-7 business days. This doesn't provide any specific information about the status of this particular order.\n", + "* The response essentially tells the customer to wait longer without offering any additional assistance or next steps.\n", + "\n", + "To improve this response, I would suggest the following:\n", + "\n", + "1. Acknowledge the customer's concern: \"Sorry to hear that you haven't received shipping confirmation yet.\"\n", + "2. Provide a more specific answer: \"I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly.\"\n", + "3. Offer additional assistance or next steps: \"If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", + "\n", + "Here's an example of a rewritten response that addresses these issues:\n", + "\n", + "\"Sorry to hear that you haven't received shipping confirmation yet. I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly. If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", + "\n", + "============================================================\n", + "πŸ’‘ Key Takeaways:\n", + "- Legal: Helps prioritize case materials\n", + "- Academic: Catches obvious methodology flaws\n", + "- Chatbot: Improves customer service quality\n", + "- All domains need clear evaluation criteria!\n" + ] + } + ], + "source": [ + "print(\"\\nπŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\")\n", + "print(\"=\" * 60)\n", + "\n", + "# Customer service scenario\n", + "customer_query = \"I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\"\n", + "chatbot_response = \"Orders usually ship within 5-7 business days. Please wait longer.\"\n", + "\n", + "print(f\"Customer: {customer_query}\")\n", + "print(f\"Chatbot: {chatbot_response}\")\n", + "\n", + "# LLM evaluation\n", + "chatbot_prompt = f\"\"\"\n", + "Evaluate this chatbot response for customer service quality:\n", + "\n", + "Customer Query: {customer_query}\n", + "Chatbot Response: {chatbot_response}\n", + "\n", + "Rate helpfulness (1-10) and suggest improvements.\n", + "\"\"\"\n", + "\n", + "try:\n", + " chatbot_result = llm.invoke(chatbot_prompt)\n", + " print(f\"\\nπŸ€– LLM Evaluation:\\n{chatbot_result.content}\")\n", + "except Exception as e:\n", + " print(f\"Error: {e}\")\n", + "\n", + "print(\"\\n\" + \"=\" * 60)\n", + "print(\"πŸ’‘ Key Takeaways:\")\n", + "print(\"- Legal: Helps prioritize case materials\")\n", + "print(\"- Academic: Catches obvious methodology flaws\") \n", + "print(\"- Chatbot: Improves customer service quality\")\n", + "print(\"- All domains need clear evaluation criteria!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The Journey Most People Take (And Why It's Problematic)\n", + "\n", + "Most people start with LLM judging the same way:\n", + "1. **\"Just ask if it's correct\"** - Seems obvious, what could go wrong?\n", + "2. **\"Ask for true/false\"** - More structured, feels better\n", + "3. **\"Give it a score\"** - Numbers feel objective and scientific\n", + "4. **\"Compare two options\"** - Let the LLM pick the better one\n", + "\n", + "**Spoiler**: Each approach has serious hidden flaws that most people never discover.\n", + "\n", + "Let's experience this journey together, starting with the most naive approach..." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Naive Approach 1: \"Just Tell Me If This Answer Is Correct\"\n", + "\n", + "This is how everyone starts. It seems so simple and obvious:\n", + "- Give the LLM a question and an answer\n", + "- Ask \"Is this answer correct?\"\n", + "- Trust the yes/no response\n", + "\n", + "### What Could Possibly Go Wrong?\n", + "Let's find out using carefully chosen examples..." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'\n", + "This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.\n", + "\n", + "User Question: What programming language should beginners learn first?\n", + "Model Answer: Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.\n", + "LLM Judge Prompt:\n", + "\n", + "Is the given answer correct? Only answer with Yes or No.\n", + "Question: 'What programming language should beginners learn first?'\n", + "Answer: 'Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.'\n", + "\n", + "LLM Response: content='Yes' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:43:50.950278Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1307772584, 'load_duration': 886591917, 'prompt_eval_count': 71, 'prompt_eval_duration': 399741583, 'eval_count': 2, 'eval_duration': 20468667, 'model_name': 'llama3.1:8b'} id='run--4c59488c-c22e-4ea3-b3cb-059760c9bf78-0' usage_metadata={'input_tokens': 71, 'output_tokens': 2, 'total_tokens': 73}\n", + "\n", + "**What goes wrong:** The LLM will often agree even if the answer is subjective and other equally valid answers exist. It can confidently state 'yes' to an opinion presented as fact, making the output seem reliable when it is not universally true.\n", + "**Hidden Flaw:** The LLM's confidence is not a reliable indicator of correctness when the question itself is subjective. Confident, research-backed language can trick the LLM into thinking advice is factual rather than contextual.\n" + ] + } + ], + "source": [ + "print(\"\\n### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'\")\n", + "print(\"This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.\")\n", + "\n", + "# User's example details\n", + "user_question_1 = \"What programming language should beginners learn first?\"\n", + "model_answer_1 = \"Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.\"\n", + "llm_judge_prompt_1 = f\"\"\"\n", + "Is the given answer correct? Only answer with Yes or No.\n", + "Question: '{user_question_1}'\n", + "Answer: '{model_answer_1}'\n", + "\"\"\"\n", + "\n", + "print(f\"\\nUser Question: {user_question_1}\")\n", + "print(f\"Model Answer: {model_answer_1}\")\n", + "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_1}\")\n", + "\n", + "# Simulating LLM response based on the problem description\n", + "# In a real scenario, llm.invoke(llm_judge_prompt_2) would be called.\n", + "# The problem description states: \"LLM likely says YES because answer sounds authoritative and mentions 'most courses' - mistaking common practice for universal truth.\"\n", + "response_1_content = llm.invoke(llm_judge_prompt_1)\n", + "response_1 = type('obj', (object,), {'content': response_1_content})() # Mocking the response object\n", + "print(f\"LLM Response: {response_1.content}\")\n", + "\n", + "print(\"\\n**What goes wrong:** The LLM will often agree even if the answer is subjective and other equally valid answers exist. It can confidently state 'yes' to an opinion presented as fact, making the output seem reliable when it is not universally true.\")\n", + "print(\"**Hidden Flaw:** The LLM's confidence is not a reliable indicator of correctness when the question itself is subjective. Confident, research-backed language can trick the LLM into thinking advice is factual rather than contextual.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Naive Approach 2: True/False Classification\n", + "\n", + "After discovering issues with simple correctness, people often move to true/false evaluation:\n", + "- Seems more structured and binary\n", + "- Feels more \"scientific\" than yes/no\n", + "- But loses important nuance..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "### Naive Approach 3: 'True/False with Nuanced Claims'\n", + "This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.\n", + "\n", + "User Question: Is the following statement true or false given the context?\n", + "Statement being evaluated (Model Answer): Exercise is good for mental health\n", + "Context provided: Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.\n", + "LLM Judge Prompt:\n", + "\n", + "Is the following statement true or false given the context? Return only True or False.\n", + "Statement: 'Exercise is good for mental health'\n", + "Context: 'Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.'\n", + "\n", + "LLM Response: content='True.' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:44:03.94595Z', 'done': True, 'done_reason': 'stop', 'total_duration': 432792000, 'load_duration': 74965458, 'prompt_eval_count': 57, 'prompt_eval_duration': 314256000, 'eval_count': 3, 'eval_duration': 42852458, 'model_name': 'llama3.1:8b'} id='run--aaa08250-df9a-4fde-af32-16148e2dca89-0' usage_metadata={'input_tokens': 57, 'output_tokens': 3, 'total_tokens': 60}\n", + "\n", + "**What goes wrong:** The LLM will likely respond 'True' because the statement is broadly accepted, despite the significant nuances and exceptions. It oversimplifies a complex topic into a binary answer.\n", + "**Hidden Flaw:** The LLM's binary 'True/False' judgment fails to capture the conditional nature or limitations of the claim. It struggles with statements that are 'mostly true' but not universally or unconditionally true, especially when the context provided supports the general truth without elaborating on exceptions.\n" + ] + } + ], + "source": [ + "print(\"\\n### Naive Approach 2: 'True/False with Nuanced Claims'\")\n", + "print(\"This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.\")\n", + "\n", + "# User's example details\n", + "user_question_2 = \"Is the following statement true or false given the context?\"\n", + "model_answer_2 = \"Exercise is good for mental health\" # This is the statement being evaluated\n", + "context_2 = \"Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.\"\n", + "llm_judge_prompt_2 = f\"\"\"\n", + "Is the following statement true or false given the context? Return only True or False.\n", + "Statement: '{model_answer_2}'\n", + "Context: '{context_2}'\n", + "\"\"\"\n", + "\n", + "print(f\"\\nUser Question: {user_question_2}\")\n", + "print(f\"Statement being evaluated (Model Answer): {model_answer_2}\")\n", + "print(f\"Context provided: {context_2}\")\n", + "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_2}\")\n", + "\n", + "# Simulating LLM response based on the problem description\n", + "# The problem description states: \"Generally true, but ignores individual variation, severity of conditions, and that exercise alone isn't sufficient for serious mental health issues\"\n", + "# An LLM would likely respond 'True' because the statement is generally accepted as true, overlooking the nuances.\n", + "response_2_content = llm.invoke(llm_judge_prompt_2)\n", + "response_2 = type('obj', (object,), {'content': response_2_content})() # Mocking the response object\n", + "print(f\"LLM Response: {response_2.content}\")\n", + "\n", + "print(\"\\n**What goes wrong:** The LLM will likely respond 'True' because the statement is broadly accepted, despite the significant nuances and exceptions. It oversimplifies a complex topic into a binary answer.\")\n", + "print(\"**Hidden Flaw:** The LLM's binary 'True/False' judgment fails to capture the conditional nature or limitations of the claim. It struggles with statements that are 'mostly true' but not universally or unconditionally true, especially when the context provided supports the general truth without elaborating on exceptions.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Naive Approach 3: Direct Scoring\n", + "\n", + "When true/false feels too limiting, people turn to scoring:\n", + "- \"Numbers are objective!\"\n", + "- \"1-10 scale feels scientific\"\n", + "- But without clear criteria, scores become arbitrary..." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "### Naive Approach 3: Direct Scoring\n", + "A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.\n", + "\n", + "User Question: Explain how photosynthesis works\n", + "Model Answer: Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.\n", + "LLM Judge Prompt:\n", + "\n", + "Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.\n", + "Question: 'Explain how photosynthesis works'\n", + "Answer: 'Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.'\n", + "\n", + "LLM Response (Example Run 1): content='8\\n\\nThis answer provides a clear and concise explanation of the basic process of photosynthesis, including the key components involved (chlorophyll, sunlight, CO2, H2O, glucose, and O2). However, it lacks detail and does not mention the overall equation for photosynthesis or the role of light-dependent reactions.' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:45:08.176698Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1674546042, 'load_duration': 76024167, 'prompt_eval_count': 69, 'prompt_eval_duration': 308406792, 'eval_count': 67, 'eval_duration': 1289491833, 'model_name': 'llama3.1:8b'} id='run--f9fb30a3-81a1-465d-9535-53a242ab2cbf-0' usage_metadata={'input_tokens': 69, 'output_tokens': 67, 'total_tokens': 136}\n", + "\n", + "**Hidden Flaw:** The LLM lacks a transparent and consistently applied internal rubric for 'quality'. Without explicit criteria provided in the prompt, its scoring becomes arbitrary, reflecting internal stochasticity rather than a stable evaluation of the answer's merit. This means 'no clear criteria means arbitrary scoring'.\n" + ] + } + ], + "source": [ + "print(\"\\n### Naive Approach 3: Direct Scoring\")\n", + "print(\"A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.\")\n", + "# User's example details\n", + "user_question_3 = \"Explain how photosynthesis works\"\n", + "model_answer_3 = \"Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.\"\n", + "llm_judge_prompt_3 = f\"\"\"\n", + "Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.\n", + "Question: '{user_question_3}'\n", + "Answer: '{model_answer_3}'\n", + "\"\"\"\n", + "\n", + "print(f\"\\nUser Question: {user_question_3}\")\n", + "print(f\"Model Answer: {model_answer_3}\")\n", + "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_3}\")\n", + "\n", + "# Simulating LLM response based on the problem description\n", + "# The problem description explicitly shows inconsistency by running multiple times.\n", + "# We'll simulate one run and then explain the inconsistency in the analysis.\n", + "# Let's pick a plausible score for a single run.\n", + "response_3= llm.invoke(llm_judge_prompt_3)\n", + "response_3 = type('obj', (object,), {'content': response_3 })() # Mocking the response object\n", + "print(f\"LLM Response (Example Run 1): {response_3.content}\")\n", + "\n", + "# To demonstrate inconsistency as per the original example, we would run it multiple times:\n", + "# For illustrative purposes in the explanation, we can mention a range.\n", + "# Example scores from multiple runs could be 8, 7, 9.\n", + "# print(f\"LLM Response (Example Run 2): 7\")\n", + "# print(f\"LLM Response (Example Run 3): 9\")\n", + "\n", + "print(\"\\n**Hidden Flaw:** The LLM lacks a transparent and consistently applied internal rubric for 'quality'. Without explicit criteria provided in the prompt, its scoring becomes arbitrary, reflecting internal stochasticity rather than a stable evaluation of the answer's merit. This means 'no clear criteria means arbitrary scoring'.\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Naive Approach 4: Compare two options\n", + "\n", + "When direct scoring proves too arbitrary and inconsistent, people often pivot to comparing options:\n", + "\n", + "\n", + "\"Let the LLM pick the better one!\"\n", + "\"It's how humans often evaluate choices, so it must be good.\"\n", + "But this method is still susceptible to subtle biases that can skew the results" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "### Naive Approach 4: Compare two options\n", + "This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.\n", + "\n", + "User Question: Describe the city of New York.\n", + "Model Answer A: New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.\n", + "Model Answer B: New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.\n", + "LLM Judge Prompt (A first):\n", + "Which of the two following answers is better?\n", + "Answer A: 'New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.'\n", + "Answer B: 'New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.'\n", + "LLM Response (A first): After analyzing both answers, I would say that **Answer A** is better for several reasons:\n", + "\n", + "1. **Specificity**: Answer A provides specific examples of New York City's significance (finance, culture, media) and iconic landmarks (Statue of Liberty, Times Square), making it more informative and engaging.\n", + "2. **Vivid language**: The use of words like \"global hub,\" \"iconic,\" \"diverse neighborhoods,\" and \"bustling atmosphere\" creates a richer and more immersive experience for the reader.\n", + "3. **Clear structure**: Answer A follows a logical structure, starting with an overview of New York City's significance and then highlighting its notable features.\n", + "4. **Engagement**: The description in Answer A is likely to pique the interest of readers who are interested in travel or learning about new places.\n", + "\n", + "In contrast, Answer B:\n", + "\n", + "1. **Lacks specificity**: It uses vague terms like \"lots of tall buildings\" and \"famous spots,\" which don't provide much insight into what makes New York City unique.\n", + "2. **Uses simple language**: While simplicity can be beneficial for some audiences, it doesn't add depth or interest to the description in this case.\n", + "3. **Fails to engage**: The description is more generic and less likely to capture the reader's attention.\n", + "\n", + "Overall, Answer A provides a more detailed, engaging, and informative description of New York City, making it the better choice.\n", + "\n", + "**Hidden flaw (Verbosity Bias, as per your description):** The judge will almost certainly favor the longer, more verbose Answer A, associating greater length with higher quality, even though Answer B is not necessarily wrong or inadequate for certain contexts. The LLM's response often reflects this preference by citing more detail or sophisticated language.\n", + "\n", + "LLM Judge Prompt (B first):\n", + "Which of the two following answers is better?\n", + "Answer A: 'New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.'\n", + "Answer B: 'New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.'\n", + "LLM Response (B first): Answer B is significantly better than Answer A for several reasons:\n", + "\n", + "1. **Specificity**: Answer B provides specific details about New York City, such as its status as a global hub for finance, culture, and media, which gives the reader a clearer understanding of what the city has to offer.\n", + "2. **Accuracy**: The information in Answer B is more accurate and up-to-date compared to Answer A. For example, it correctly identifies New York City as the most populous city in the United States (although this may change over time).\n", + "3. **Organization**: Answer B presents its information in a clear and organized manner, making it easier for the reader to follow.\n", + "4. **Style**: The language used in Answer B is more formal and polished than Answer A, which makes it more suitable for academic or professional writing.\n", + "\n", + "Answer A, on the other hand, is more general and lacks specific details about New York City. It also uses vague terms like \"lots of tall buildings\" and \"famous spots,\" which don't provide much insight into what makes the city unique.\n", + "\n", + "Overall, Answer B is a better choice because it provides more accurate, specific, and organized information that gives the reader a clearer understanding of New York City's characteristics.\n", + "\n", + "**Hidden flaw (Position Bias, as per your description):** If the order of the answers were swapped (as demonstrated in the second prompt), there is a chance the LLM would favor the new 'Answer A' (which is now the less verbose one), demonstrating an unconscious preference for the first item presented. This bias is particularly problematic for instances where the answers are of similar quality, as the LLM's preference can be swayed by the arbitrary ordering.\n" + ] + } + ], + "source": [ + "print(\"\\n### Naive Approach 4: Compare two options\")\n", + "print(\"This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.\")\n", + "\n", + "# User's example details\n", + "user_question_4 = \"Describe the city of New York.\"\n", + "model_answer_A_4 = \"New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.\"\n", + "model_answer_B_4 = \"New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.\"\n", + "\n", + "# Prompt for Verbosity Bias (A is more verbose)\n", + "llm_judge_prompt_4_verbosity = f\"\"\"Which of the two following answers is better?\n", + "Answer A: '{model_answer_A_4}'\n", + "Answer B: '{model_answer_B_4}'\"\"\"\n", + "\n", + "print(f\"\\nUser Question: {user_question_4}\")\n", + "print(f\"Model Answer A: {model_answer_A_4}\")\n", + "print(f\"Model Answer B: {model_answer_B_4}\")\n", + "print(f\"LLM Judge Prompt (A first):\\n{llm_judge_prompt_4_verbosity}\")\n", + "\n", + "response_4_verbosity = llm.invoke(llm_judge_prompt_4_verbosity)\n", + "print(f\"LLM Response (A first): {response_4_verbosity.content}\")\n", + "\n", + "print(\"\\n**Hidden flaw (Verbosity Bias, as per your description):** The judge will almost certainly favor the longer, more verbose Answer A, associating greater length with higher quality, even though Answer B is not necessarily wrong or inadequate for certain contexts. The LLM's response often reflects this preference by citing more detail or sophisticated language.\")\n", + "\n", + "# Prompt for Position Bias (swapping A and B)\n", + "llm_judge_prompt_4_position = f\"\"\"Which of the two following answers is better?\n", + "Answer A: '{model_answer_B_4}'\n", + "Answer B: '{model_answer_A_4}'\"\"\" # Swapped order\n", + "\n", + "print(f\"\\nLLM Judge Prompt (B first):\\n{llm_judge_prompt_4_position}\")\n", + "\n", + "response_4_position = llm.invoke(llm_judge_prompt_4_position)\n", + "print(f\"LLM Response (B first): {response_4_position.content}\")\n", + "\n", + "print(\"\\n**Hidden flaw (Position Bias, as per your description):** If the order of the answers were swapped (as demonstrated in the second prompt), there is a chance the LLM would favor the new 'Answer A' (which is now the less verbose one), demonstrating an unconscious preference for the first item presented. This bias is particularly problematic for instances where the answers are of similar quality, as the LLM's preference can be swayed by the arbitrary ordering.\")\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Summary: The Path Forward\n", + "\n", + "### What We've Learned:\n", + "\n", + "**The Problem:**\n", + "1. **Traditional evaluation methods** don't work for modern AI systems\n", + "2. **AI models fail silently** without proper evaluation\n", + "3. **Naive LLM judging approaches** have hidden flaws\n", + "\n", + "**The Solution:**\n", + "1. **LLM as Judge** provides scalable, understanding\n", + "2. **Proper implementation** requires systematic approaches\n", + "3. **Success measurement** needs concrete metrics and bias detection\n", + "\n", + "### Next Steps:\n", + "In the following sessions, we'll build sophisticated solutions:\n", + "- **Session 2**: Progressive improvements and structured approaches\n", + "- **Session 3**: Production-ready systems with bias detection\n", + "- **Final Challenge**: Building comprehensive evaluation pipelines\n", + "\n", + "---\n", + "\n", + "**You now understand both the promise and perils of LLM as Judge systems. Ready to build better solutions?**" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "qna_gen", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From b3118a48cc8bd92676df6bf6ab8fecb44c8c460a Mon Sep 17 00:00:00 2001 From: snehangshuk Date: Fri, 17 Oct 2025 18:17:18 +0530 Subject: [PATCH 6/6] Remove the introductory session notebook on AI evaluation, streamlining the content for better focus on subsequent modules. This deletion eliminates outdated material and enhances the overall structure of the course. --- session_1_introduction_and_basics.ipynb | 1048 ----------------------- 1 file changed, 1048 deletions(-) delete mode 100644 session_1_introduction_and_basics.ipynb diff --git a/session_1_introduction_and_basics.ipynb b/session_1_introduction_and_basics.ipynb deleted file mode 100644 index cc6e79d..0000000 --- a/session_1_introduction_and_basics.ipynb +++ /dev/null @@ -1,1048 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction\n", - "\n", - "**You have an AI model. It seems to work. But how do you actually know?​**\n", - "\n", - "### Common Pain Points:\n", - "- **Retrieval fails silently**: Gets irrelevant chunks but you don't notice\n", - "- **Context gets lost**: Important info split across chunks disappears \n", - "- **Hallucination persists**: LLM makes up facts even with good sources\n", - "- **Quality varies wildly**: Same question, different quality answers each time\n", - "- **Manual checking doesn't scale**: Can't manually verify thousands of responses\n", - "\n", - "### The $10M Question:\n", - "*\"How do you evaluate AI systems that generate nuanced, contextual responses at scale?\"*" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Why Evaluations Are Critical (Real-World Impact)\n", - "\n", - "print(\"🚨 HIGH-STAKES AI DEPLOYMENT REALITY\")\n", - "print(\"=\" * 45)\n", - "\n", - "deployment_stats = {\n", - " \"Customer Service Bots\": \"Handle millions of conversations daily\",\n", - " \"Content Moderation\": \"Process billions of social media posts\", \n", - " \"Medical AI\": \"Assist in patient diagnosis and treatment\",\n", - " \"Legal AI\": \"Evaluate document relevance in court cases\",\n", - " \"Financial AI\": \"Determine loan approvals and credit decisions\",\n", - " \"Educational AI\": \"Grade student work and provide feedback\"\n", - "}\n", - "\n", - "print(\"Current AI Scale:\")\n", - "for system, impact in deployment_stats.items():\n", - " print(f\"β€’ {system}: {impact}\")\n", - "\n", - "print(\"\\nπŸ’° COST OF POOR EVALUATION:\")\n", - "print(\"-\" * 30)\n", - "\n", - "failure_costs = {\n", - " \"Customer Churn\": \"23% abandon AI tools after bad experience\",\n", - " \"Support Costs\": \"Poor AI increases human tickets by 40%\", \n", - " \"Brand Damage\": \"AI failures become viral social content\",\n", - " \"Legal Liability\": \"Biased systems face discrimination lawsuits\",\n", - " \"Regulatory Risk\": \"Can't prove compliance without measurement\"\n", - "}\n", - "\n", - "for cost_type, impact in failure_costs.items():\n", - " print(f\"β€’ {cost_type}: {impact}\")\n", - "\n", - "print(\"\\n🎯 THE BOTTOM LINE:\")\n", - "print(\"Without proper evaluation, AI systems fail silently at scale.\")\n", - "print(\"LLM judges provide the solution - but only if built correctly!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## πŸ“Š Traditional Evaluation Methods\n", - "\n", - "### Human Evaluation Methods:\n", - "- **Expert assessment**: Manual rating but $5-50 per evaluation\n", - "- **Weeks to scale**: Gold standard quality, impossible timeline\n", - "- **Subjective bias**: Different evaluators, different standards\n", - "- **Can't handle volume**: Thousands of outputs daily\n", - "\n", - "### Reference-Based Automated Metrics:\n", - "- **Exact Match**: Perfect matches only, zero tolerance\n", - "- **F1 Score**: Token overlap, misses meaning\n", - "- **BLEU**: Translation metric, ignores factual accuracy\n", - "- **ROUGE**: Content recall, can't detect hallucinations\n", - "\n", - "### Critical Limitations:\n", - "- **Rigid scoring**: Correct rephrases score poorly\n", - "- **Missing hallucination detection**: Can't spot made-up facts\n", - "- **Context blind**: Ignores document grounding\n", - "- **Too slow**: Can't monitor production systems real-time" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exact Match (EM)\n", - "\n", - "Definition: Exact Match is a binary metric that determines if a generated text is perfectly identical to a reference text. It is a very strict measure, returning 1 (true) only if every character matches, including case, punctuation, and spacing; otherwise, it returns 0 (false). It has \"zero tolerance\" for any deviation.\n", - "\n", - "\n", - "Formula:\n", - "$$ EM(R, C) = \\begin{cases} 1 & \\text{if } R = C \\ 0 & \\text{if } R \\neq C \\end{cases} $$\n", - "Where:\n", - "\n", - "\n", - "$R$ is the Reference text.\n", - "$C$ is the Candidate (generated) text.\n", - "\n", - "Exact Match is straightforward to implement manually or can be found in some NLP toolkits." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Reference: 'The capital of France is Paris.'\n", - "Candidate 1: 'The capital of France is Paris.' -> EM Score: 1\n", - "Candidate 2: 'The capital of France is paris.' -> EM Score: 0\n", - "Candidate 3: 'Paris is the capital of France.' -> EM Score: 0\n" - ] - } - ], - "source": [ - "def exact_match(reference: str, candidate: str) -> int:\n", - " \"\"\"\n", - " Calculates the Exact Match score between a reference and a candidate string.\n", - " Returns 1 if they are identical, 0 otherwise.\n", - " \"\"\"\n", - " return 1 if reference == candidate else 0\n", - "\n", - "# Working Example\n", - "reference_em = \"The capital of France is Paris.\"\n", - "\n", - "candidate_em_1 = \"The capital of France is Paris.\"\n", - "candidate_em_2 = \"The capital of France is paris.\"\n", - "candidate_em_3 = \"Paris is the capital of France.\"\n", - "\n", - "print(f\"Reference: '{reference_em}'\")\n", - "print(f\"Candidate 1: '{candidate_em_1}' -> EM Score: {exact_match(reference_em, candidate_em_1)}\")\n", - "print(f\"Candidate 2: '{candidate_em_2}' -> EM Score: {exact_match(reference_em, candidate_em_2)}\")\n", - "print(f\"Candidate 3: '{candidate_em_3}' -> EM Score: {exact_match(reference_em, candidate_em_3)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### F1 Score\n", - "\n", - "Definition: The F1 Score is the harmonic mean of Precision and Recall. In the context of NLP text generation evaluation (especially for tasks like question answering where token overlap is important), it measures the overlap between the references in the generated text and the reference text.\n", - "\n", - "\n", - "Precision: Measures how many of the references in the generated text are also present in the reference text. It answers: \"Of all the references I generated, how many were correct?\"\n", - "Recall: Measures how many of the references in the reference text were captured by the generated text. It answers: \"Of all the correct references, how many did I generate?\"\n", - "\n", - "Formulas:\n", - "Let:\n", - "\n", - "\n", - "$TP$ (True Positives) = Number of references common to both the candidate and reference texts.\n", - "$FP$ (False Positives) = Number of references in the candidate text but not in the reference text.\n", - "$FN$ (False Negatives) = Number of references in the reference text but not in the candidate text.\n", - "\n", - "$$ Precision = \\frac{TP}{TP + FP} = \\frac{\\text{Number of matching references}}{\\text{Total references in candidate}} $$\n", - "$$ Recall = \\frac{TP}{TP + FN} = \\frac{\\text{Number of matching references}}{\\text{Total references in reference}} $$\n", - "$$ F1 = 2 \\times \\frac{Precision \\times Recall}{Precision + Recall} $$\n", - "\n", - "For token-level F1, we often use sklearn.metrics.f1_score after converting strings to sets of references." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Reference references: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']\n", - "Candidate references: ['a', 'quick', 'fox', 'jumps', 'over', 'a', 'dog.']\n", - "F1 Score (token-level): 0.625\n" - ] - } - ], - "source": [ - "from collections import Counter\n", - "\n", - "def calculate_f1_score_references(reference_references: list, candidate_references: list) -> float:\n", - " \"\"\"\n", - " Calculates the token-level F1 score between a reference and a candidate list of references.\n", - " \"\"\"\n", - " common = Counter(reference_references) & Counter(candidate_references)\n", - " num_common = sum(common.values())\n", - "\n", - " if num_common == 0:\n", - " return 0.0\n", - "\n", - " precision = num_common / len(candidate_references)\n", - " recall = num_common / len(reference_references)\n", - "\n", - " f1 = (2 * precision * recall) / (precision + recall)\n", - " return f1\n", - "\n", - "# Working Example\n", - "reference_f1 = \"The quick brown fox jumps over the lazy dog.\"\n", - "candidate_f1 = \"A quick fox jumps over a dog.\"\n", - "\n", - "# Tokenize the sentences (simple split for demonstration)\n", - "reference_references_f1 = reference_f1.lower().split()\n", - "candidate_references_f1 = candidate_f1.lower().split()\n", - "\n", - "print(f\"\\nReference references: {reference_references_f1}\")\n", - "print(f\"Candidate references: {candidate_references_f1}\")\n", - "print(f\"F1 Score (token-level): {calculate_f1_score_references(reference_references_f1, candidate_references_f1):.3f}\")\n", - "\n", - "# Using sklearn for comparison (requires converting to binary labels, which is less direct for this specific use case)\n", - "# For direct token overlap, the custom function above is more illustrative.\n", - "# If using sklearn, it's typically for classification where each token is a class." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## What is LLM as a Judge?\n", - "\n", - "Large Language Models (LLMs) as judges represent a paradigm where we leverage the reasoning capabilities of LLMs to evaluate, score, and assess various types of content, conversations, or decisions.\n", - "\n", - "### Key Characteristics:\n", - "- **Automated Evaluation**: Replace human evaluators in specific contexts\n", - "- **Consistent Scoring**: Provide standardized assessment criteria\n", - "- **Scalable Assessment**: Handle large volumes of evaluation tasks\n", - "- **Multi-dimensional Analysis**: Evaluate multiple criteria simultaneously\n", - "\n", - "### Why LLM Judges Changed Everything:\n", - "- **Semantic Understanding**: Recognizes paraphrasing and meaning beyond keywords\n", - "- **Scalable Human-like Judgment**: Thousands of evaluations in minutes vs weeks\n", - "- **Reference-free Evaluation**: Can assess faithfulness without ground truth\n", - "- **Contextual Assessment**: Considers domain expertise and user intent" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "βœ… ChatOllama initialized with llama3.1:8b model\n" - ] - } - ], - "source": [ - "# Setup and imports\n", - "import os\n", - "import json\n", - "import pandas as pd\n", - "from typing import Dict, List, Any, Optional\n", - "from langchain_ollama import ChatOllama\n", - "from langchain_core.messages import HumanMessage, SystemMessage\n", - "\n", - "# Initialize LLM\n", - "try:\n", - " llm = ChatOllama(model=\"llama3.1:8b\", temperature=0)\n", - " llm.invoke(\"Hello World!\")\n", - " print(\"βœ… ChatOllama initialized with llama3.1:8b model\")\n", - "except Exception as e:\n", - " print(f\"❌ Failed to initialize ChatOllama: {e}\")\n", - " print(\"Please make sure Ollama is installed and running with llama3.1 model\")" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Sample Text: \n", - "The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.\n", - "It's commonly used for testing fonts and keyboards.\n", - "\n", - "\n", - "Evaluation Criteria:\n", - "- Clarity: How clear and understandable is the text?\n", - "- Informativeness: How much useful information does it provide?\n", - "- Engagement: How engaging is the content for readers?\n" - ] - } - ], - "source": [ - "# Simple example of LLM evaluation concept\n", - "sample_text = \"\"\"\n", - "The quick brown fox jumps over the lazy dog. This sentence contains all letters of the alphabet.\n", - "It's commonly used for testing fonts and keyboards.\n", - "\"\"\"\n", - "\n", - "evaluation_criteria = {\n", - " \"clarity\": \"How clear and understandable is the text?\",\n", - " \"informativeness\": \"How much useful information does it provide?\",\n", - " \"engagement\": \"How engaging is the content for readers?\"\n", - "}\n", - "\n", - "print(\"Sample Text:\", sample_text)\n", - "print(\"\\nEvaluation Criteria:\")\n", - "for criterion, description in evaluation_criteria.items():\n", - " print(f\"- {criterion.title()}: {description}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "πŸ€– LLM EVALUATION RESULTS\n", - "\n", - "🎯 Evaluating: Clarity\n", - "----------------------------------------\n", - "LLM Response:\n", - "Score: 9/10\n", - "Reasoning: The text is clear and easy to understand, but it assumes some prior knowledge about the purpose of the sentence. A reader who has never heard of this sentence before might not fully grasp its significance or why it's used for testing fonts and keyboards. However, the language itself is simple and straightforward, making it accessible to a wide range of readers.\n", - "\n", - "🎯 Evaluating: Informativeness\n", - "----------------------------------------\n", - "LLM Response:\n", - "Score: 6/10\n", - "Reasoning: The text provides some useful information about the sentence, specifically its use for testing fonts and keyboards. However, it doesn't provide much depth or context beyond that. It also assumes prior knowledge of why this particular sentence is significant (i.e., containing all letters of the alphabet), which limits its usefulness to readers who are already familiar with this fact.\n", - "\n", - "🎯 Evaluating: Engagement\n", - "----------------------------------------\n", - "LLM Response:\n", - "Score: 2/10\n", - "Reasoning: The content is dry and lacks any narrative or emotional appeal. It's primarily informative, stating a fact about the sentence's composition and its practical application. While it may be interesting for those who appreciate linguistic trivia, it's unlikely to engage readers on an emotional level or spark their curiosity in a significant way.\n" - ] - } - ], - "source": [ - "print(\"πŸ€– LLM EVALUATION RESULTS\")\n", - "# Now let's use the LLM to evaluate the text against each criterion\n", - "for criterion, description in evaluation_criteria.items():\n", - " print(f\"\\n🎯 Evaluating: {criterion.title()}\")\n", - " print(\"-\" * 40)\n", - " \n", - " # Create evaluation prompt\n", - " evaluation_prompt = f\"\"\"\n", - "Please evaluate the following text based on this criterion: {description}\n", - "\n", - "Text to evaluate: {sample_text.strip()}\n", - "\n", - "Provide a score from 1-10 and a brief explanation of your reasoning.\n", - "Format your response as:\n", - "Score: X/10\n", - "Reasoning: [Your explanation]\n", - "\"\"\"\n", - " \n", - " # Get LLM evaluation\n", - " try:\n", - " response = llm.invoke(evaluation_prompt)\n", - " print(f\"LLM Response:\\n{response.content}\")\n", - " except Exception as e:\n", - " print(f\"❌ Error getting evaluation: {e}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Applications Across Domains\n", - "\n", - "### Legal and Judicial Applications\n", - "- **Document Relevance Scoring**: Assess relevance of legal documents to cases\n", - "- **Case Law Analysis**: Evaluate similarity between legal precedents\n", - "- **Judicial Decision Support**: Assist in evidence evaluation and consistency checking\n", - "\n", - "### Content Quality Evaluation\n", - "- **Academic Paper Review**: Automated initial screening of research papers\n", - "- **Content Moderation**: Scale content review for platforms\n", - "- **Customer Service Quality**: Evaluate support interactions\n", - "\n", - "### Conversation Assessment\n", - "- **Chatbot Performance**: Evaluate AI assistant responses\n", - "- **Human-likeness Detection**: Assess naturalness of generated conversations\n", - "- **Training Data Quality**: Validate synthetic conversation datasets" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "πŸ›οΈ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring\n", - "============================================================\n", - "Case: Personal injury lawsuit: slip and fall at grocery store\n", - "Document: Store surveillance footage showing wet floor conditions on day of incident\n", - "\n", - "πŸ€– LLM Evaluation:\n", - "I would rate the document's relevance to the legal case as a 9 out of 10.\n", - "\n", - "The document is directly related to the incident in question, providing visual evidence of the store's condition at the time of the slip and fall. The footage can be used to:\n", - "\n", - "* Support or refute claims made by the plaintiff about the cause of the accident\n", - "* Show that the store was aware of the wet floor conditions and failed to take adequate measures to address them\n", - "* Demonstrate the extent of the hazard posed by the wet floor\n", - "\n", - "The only reason I wouldn't give it a perfect 10 is that, without more context or analysis, we can't be certain what specific details the footage shows. However, in general, store surveillance footage is highly relevant and probative evidence in slip and fall cases like this one.\n", - "\n", - "πŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\n", - "============================================================\n", - "Customer: I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\n", - "Chatbot: Orders usually ship within 5-7 business days. Please wait longer.\n", - "\n", - "πŸ€– LLM Evaluation:\n", - "I would rate the helpfulness of this chatbot response as a 2 out of 10.\n", - "\n", - "The response is unhelpful for several reasons:\n", - "\n", - "* It doesn't acknowledge the customer's concern or frustration about not receiving shipping confirmation.\n", - "* The answer is too vague, stating only that orders \"usually\" ship within 5-7 business days. This doesn't provide any specific information about the status of this particular order.\n", - "* The response essentially tells the customer to wait longer without offering any additional assistance or next steps.\n", - "\n", - "To improve this response, I would suggest the following:\n", - "\n", - "1. Acknowledge the customer's concern: \"Sorry to hear that you haven't received shipping confirmation yet.\"\n", - "2. Provide a more specific answer: \"I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly.\"\n", - "3. Offer additional assistance or next steps: \"If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", - "\n", - "Here's an example of a rewritten response that addresses these issues:\n", - "\n", - "\"Sorry to hear that you haven't received shipping confirmation yet. I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly. If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", - "\n", - "πŸ’‘ Key Takeaways:\n", - "- Legal: Helps prioritize case materials\n", - "- Chatbot: Improves customer service quality\n", - "- All domains need clear evaluation criteria!\n" - ] - } - ], - "source": [ - "# Domain Examples for LLM as Judge\n", - "\n", - "print(\"πŸ›οΈ LEGAL DOMAIN EXAMPLE: Document Relevance Scoring\")\n", - "print(\"=\" * 60)\n", - "\n", - "# Case scenario\n", - "legal_case = \"Personal injury lawsuit: slip and fall at grocery store\"\n", - "sample_document = \"Store surveillance footage showing wet floor conditions on day of incident\"\n", - "\n", - "print(f\"Case: {legal_case}\")\n", - "print(f\"Document: {sample_document}\")\n", - "\n", - "# LLM evaluation\n", - "legal_prompt = f\"\"\"\n", - "Rate this document's relevance to the legal case (1-10 scale):\n", - "\n", - "Case: {legal_case}\n", - "Document: {sample_document}\n", - "\n", - "Provide: Score (1-10) and brief reasoning.\n", - "\"\"\"\n", - "\n", - "try:\n", - " legal_result = llm.invoke(legal_prompt)\n", - " print(f\"\\nπŸ€– LLM Evaluation:\\n{legal_result.content}\")\n", - "except Exception as e:\n", - " print(f\"Error: {e}\")\n", - "\n", - "print(\"\\nπŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\")\n", - "print(\"=\" * 60)\n", - "\n", - "# Customer service scenario\n", - "customer_query = \"I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\"\n", - "chatbot_response = \"Orders usually ship within 5-7 business days. Please wait longer.\"\n", - "\n", - "print(f\"Customer: {customer_query}\")\n", - "print(f\"Chatbot: {chatbot_response}\")\n", - "\n", - "# LLM evaluation\n", - "chatbot_prompt = f\"\"\"\n", - "Evaluate this chatbot response for customer service quality:\n", - "\n", - "Customer Query: {customer_query}\n", - "Chatbot Response: {chatbot_response}\n", - "\n", - "Rate helpfulness (1-10) and suggest improvements.\n", - "\"\"\"\n", - "\n", - "try:\n", - " chatbot_result = llm.invoke(chatbot_prompt)\n", - " print(f\"\\nπŸ€– LLM Evaluation:\\n{chatbot_result.content}\")\n", - "except Exception as e:\n", - " print(f\"Error: {e}\")\n", - "\n", - "print(\"\\nπŸ’‘ Key Takeaways:\")\n", - "print(\"- Legal: Helps prioritize case materials\")\n", - "print(\"- Chatbot: Improves customer service quality\")\n", - "print(\"- All domains need clear evaluation criteria!\")" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "πŸ“ CONTENT QUALITY EXAMPLE: Academic Paper Review\n", - "============================================================\n", - "Abstract to review:\n", - "\n", - "We surveyed 10 students about social media and mood. Students using social media \n", - "more than 3 hours daily sometimes felt sad. Therefore, social media is bad for \n", - "all teenagers and should be banned.\n", - "\n", - "\n", - "πŸ€– LLM Review:\n", - "I'd rate this abstract a 2 out of 10 in terms of quality.\n", - "\n", - "Here are the main problems I've identified:\n", - "\n", - "1. **Sample size**: The sample size is extremely small, consisting of only 10 students. This is not sufficient to draw any meaningful conclusions about social media use and mood among teenagers.\n", - "2. **Lack of control group**: There is no comparison group or control condition in this study. How do we know that the students who used social media more than 3 hours a day would have felt sad if they hadn't used social media? A control group would help to establish causality.\n", - "3. **Correlation vs. causation**: The abstract implies that using social media causes sadness, but it's possible that there are other factors at play (e.g., students who use social media more may be more prone to depression or anxiety). Correlational studies like this one can't establish cause-and-effect relationships.\n", - "4. **Overly broad conclusion**: The abstract concludes that \"social media is bad for all teenagers and should be banned.\" This is an overly simplistic and sweeping statement, especially given the small sample size and lack of control group.\n", - "5. **Lack of statistical analysis**: There's no mention of any statistical tests or analyses used to examine the relationship between social media use and mood. This makes it difficult to evaluate the validity of the findings.\n", - "\n", - "Overall, this abstract raises more questions than answers, and its conclusions are likely based on a flawed methodology.\n" - ] - } - ], - "source": [ - "print(\"\\nπŸ“ CONTENT QUALITY EXAMPLE: Academic Paper Review\")\n", - "print(\"=\" * 60)\n", - "\n", - "# Sample abstract with obvious flaws\n", - "paper_abstract = \"\"\"\n", - "We surveyed 10 students about social media and mood. Students using social media \n", - "more than 3 hours daily sometimes felt sad. Therefore, social media is bad for \n", - "all teenagers and should be banned.\n", - "\"\"\"\n", - "\n", - "print(f\"Abstract to review:\\n{paper_abstract}\")\n", - "\n", - "# LLM evaluation\n", - "academic_prompt = f\"\"\"\n", - "Review this academic abstract for quality issues:\n", - "\n", - "Abstract: {paper_abstract}\n", - "\n", - "Rate (1-10) and identify main problems with methodology, sample size, or conclusions.\n", - "\"\"\"\n", - "\n", - "try:\n", - " academic_result = llm.invoke(academic_prompt)\n", - " print(f\"\\nπŸ€– LLM Review:\\n{academic_result.content}\")\n", - "except Exception as e:\n", - " print(f\"Error: {e}\")\n" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "πŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\n", - "============================================================\n", - "Customer: I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\n", - "Chatbot: Orders usually ship within 5-7 business days. Please wait longer.\n", - "\n", - "πŸ€– LLM Evaluation:\n", - "I would rate the helpfulness of this chatbot response as a 2 out of 10.\n", - "\n", - "The response is unhelpful for several reasons:\n", - "\n", - "* It doesn't acknowledge the customer's concern or frustration about not receiving shipping confirmation.\n", - "* The answer is too vague, stating only that orders \"usually\" ship within 5-7 business days. This doesn't provide any specific information about the status of this particular order.\n", - "* The response essentially tells the customer to wait longer without offering any additional assistance or next steps.\n", - "\n", - "To improve this response, I would suggest the following:\n", - "\n", - "1. Acknowledge the customer's concern: \"Sorry to hear that you haven't received shipping confirmation yet.\"\n", - "2. Provide a more specific answer: \"I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly.\"\n", - "3. Offer additional assistance or next steps: \"If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", - "\n", - "Here's an example of a rewritten response that addresses these issues:\n", - "\n", - "\"Sorry to hear that you haven't received shipping confirmation yet. I've checked on your order and it was shipped out yesterday. You should receive an email with tracking information shortly. If you don't receive the email within the next 24 hours, please let me know and I'll be happy to look into this further.\"\n", - "\n", - "============================================================\n", - "πŸ’‘ Key Takeaways:\n", - "- Legal: Helps prioritize case materials\n", - "- Academic: Catches obvious methodology flaws\n", - "- Chatbot: Improves customer service quality\n", - "- All domains need clear evaluation criteria!\n" - ] - } - ], - "source": [ - "print(\"\\nπŸ’¬ CONVERSATION EXAMPLE: Chatbot Response Quality\")\n", - "print(\"=\" * 60)\n", - "\n", - "# Customer service scenario\n", - "customer_query = \"I ordered a laptop 3 days ago but haven't received shipping confirmation. Can you help?\"\n", - "chatbot_response = \"Orders usually ship within 5-7 business days. Please wait longer.\"\n", - "\n", - "print(f\"Customer: {customer_query}\")\n", - "print(f\"Chatbot: {chatbot_response}\")\n", - "\n", - "# LLM evaluation\n", - "chatbot_prompt = f\"\"\"\n", - "Evaluate this chatbot response for customer service quality:\n", - "\n", - "Customer Query: {customer_query}\n", - "Chatbot Response: {chatbot_response}\n", - "\n", - "Rate helpfulness (1-10) and suggest improvements.\n", - "\"\"\"\n", - "\n", - "try:\n", - " chatbot_result = llm.invoke(chatbot_prompt)\n", - " print(f\"\\nπŸ€– LLM Evaluation:\\n{chatbot_result.content}\")\n", - "except Exception as e:\n", - " print(f\"Error: {e}\")\n", - "\n", - "print(\"\\n\" + \"=\" * 60)\n", - "print(\"πŸ’‘ Key Takeaways:\")\n", - "print(\"- Legal: Helps prioritize case materials\")\n", - "print(\"- Academic: Catches obvious methodology flaws\") \n", - "print(\"- Chatbot: Improves customer service quality\")\n", - "print(\"- All domains need clear evaluation criteria!\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## The Journey Most People Take (And Why It's Problematic)\n", - "\n", - "Most people start with LLM judging the same way:\n", - "1. **\"Just ask if it's correct\"** - Seems obvious, what could go wrong?\n", - "2. **\"Ask for true/false\"** - More structured, feels better\n", - "3. **\"Give it a score\"** - Numbers feel objective and scientific\n", - "4. **\"Compare two options\"** - Let the LLM pick the better one\n", - "\n", - "**Spoiler**: Each approach has serious hidden flaws that most people never discover.\n", - "\n", - "Let's experience this journey together, starting with the most naive approach..." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Naive Approach 1: \"Just Tell Me If This Answer Is Correct\"\n", - "\n", - "This is how everyone starts. It seems so simple and obvious:\n", - "- Give the LLM a question and an answer\n", - "- Ask \"Is this answer correct?\"\n", - "- Trust the yes/no response\n", - "\n", - "### What Could Possibly Go Wrong?\n", - "Let's find out using carefully chosen examples..." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'\n", - "This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.\n", - "\n", - "User Question: What programming language should beginners learn first?\n", - "Model Answer: Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.\n", - "LLM Judge Prompt:\n", - "\n", - "Is the given answer correct? Only answer with Yes or No.\n", - "Question: 'What programming language should beginners learn first?'\n", - "Answer: 'Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.'\n", - "\n", - "LLM Response: content='Yes' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:43:50.950278Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1307772584, 'load_duration': 886591917, 'prompt_eval_count': 71, 'prompt_eval_duration': 399741583, 'eval_count': 2, 'eval_duration': 20468667, 'model_name': 'llama3.1:8b'} id='run--4c59488c-c22e-4ea3-b3cb-059760c9bf78-0' usage_metadata={'input_tokens': 71, 'output_tokens': 2, 'total_tokens': 73}\n", - "\n", - "**What goes wrong:** The LLM will often agree even if the answer is subjective and other equally valid answers exist. It can confidently state 'yes' to an opinion presented as fact, making the output seem reliable when it is not universally true.\n", - "**Hidden Flaw:** The LLM's confidence is not a reliable indicator of correctness when the question itself is subjective. Confident, research-backed language can trick the LLM into thinking advice is factual rather than contextual.\n" - ] - } - ], - "source": [ - "print(\"\\n### Naive Approach 1: 'Just Tell Me If This Answer Is Correct'\")\n", - "print(\"This method fails spectacularly when the LLM is presented with information that is both plausible and incorrect. The LLM may lack the internal process to critically verify a statement that it is presented with as fact, especially if the answer is short and lacks context.\")\n", - "\n", - "# User's example details\n", - "user_question_1 = \"What programming language should beginners learn first?\"\n", - "model_answer_1 = \"Python is an excellent choice for beginners because it has clean, readable syntax and a gentle learning curve. Most computer science courses and coding bootcamps start with Python.\"\n", - "llm_judge_prompt_1 = f\"\"\"\n", - "Is the given answer correct? Only answer with Yes or No.\n", - "Question: '{user_question_1}'\n", - "Answer: '{model_answer_1}'\n", - "\"\"\"\n", - "\n", - "print(f\"\\nUser Question: {user_question_1}\")\n", - "print(f\"Model Answer: {model_answer_1}\")\n", - "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_1}\")\n", - "\n", - "# Simulating LLM response based on the problem description\n", - "# In a real scenario, llm.invoke(llm_judge_prompt_2) would be called.\n", - "# The problem description states: \"LLM likely says YES because answer sounds authoritative and mentions 'most courses' - mistaking common practice for universal truth.\"\n", - "response_1_content = llm.invoke(llm_judge_prompt_1)\n", - "response_1 = type('obj', (object,), {'content': response_1_content})() # Mocking the response object\n", - "print(f\"LLM Response: {response_1.content}\")\n", - "\n", - "print(\"\\n**What goes wrong:** The LLM will often agree even if the answer is subjective and other equally valid answers exist. It can confidently state 'yes' to an opinion presented as fact, making the output seem reliable when it is not universally true.\")\n", - "print(\"**Hidden Flaw:** The LLM's confidence is not a reliable indicator of correctness when the question itself is subjective. Confident, research-backed language can trick the LLM into thinking advice is factual rather than contextual.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Naive Approach 2: True/False Classification\n", - "\n", - "After discovering issues with simple correctness, people often move to true/false evaluation:\n", - "- Seems more structured and binary\n", - "- Feels more \"scientific\" than yes/no\n", - "- But loses important nuance..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "### Naive Approach 3: 'True/False with Nuanced Claims'\n", - "This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.\n", - "\n", - "User Question: Is the following statement true or false given the context?\n", - "Statement being evaluated (Model Answer): Exercise is good for mental health\n", - "Context provided: Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.\n", - "LLM Judge Prompt:\n", - "\n", - "Is the following statement true or false given the context? Return only True or False.\n", - "Statement: 'Exercise is good for mental health'\n", - "Context: 'Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.'\n", - "\n", - "LLM Response: content='True.' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:44:03.94595Z', 'done': True, 'done_reason': 'stop', 'total_duration': 432792000, 'load_duration': 74965458, 'prompt_eval_count': 57, 'prompt_eval_duration': 314256000, 'eval_count': 3, 'eval_duration': 42852458, 'model_name': 'llama3.1:8b'} id='run--aaa08250-df9a-4fde-af32-16148e2dca89-0' usage_metadata={'input_tokens': 57, 'output_tokens': 3, 'total_tokens': 60}\n", - "\n", - "**What goes wrong:** The LLM will likely respond 'True' because the statement is broadly accepted, despite the significant nuances and exceptions. It oversimplifies a complex topic into a binary answer.\n", - "**Hidden Flaw:** The LLM's binary 'True/False' judgment fails to capture the conditional nature or limitations of the claim. It struggles with statements that are 'mostly true' but not universally or unconditionally true, especially when the context provided supports the general truth without elaborating on exceptions.\n" - ] - } - ], - "source": [ - "print(\"\\n### Naive Approach 2: 'True/False with Nuanced Claims'\")\n", - "print(\"This method fails when the LLM is asked to evaluate a nuanced claim as a simple true or false statement, even with context. The LLM may lack the ability to acknowledge the complexities, exceptions, or varying degrees of truth within a statement, leading to an oversimplified 'True' or 'False' response that misses critical subtleties.\")\n", - "\n", - "# User's example details\n", - "user_question_2 = \"Is the following statement true or false given the context?\"\n", - "model_answer_2 = \"Exercise is good for mental health\" # This is the statement being evaluated\n", - "context_2 = \"Regular moderate exercise has been shown in numerous studies to reduce symptoms of depression and anxiety.\"\n", - "llm_judge_prompt_2 = f\"\"\"\n", - "Is the following statement true or false given the context? Return only True or False.\n", - "Statement: '{model_answer_2}'\n", - "Context: '{context_2}'\n", - "\"\"\"\n", - "\n", - "print(f\"\\nUser Question: {user_question_2}\")\n", - "print(f\"Statement being evaluated (Model Answer): {model_answer_2}\")\n", - "print(f\"Context provided: {context_2}\")\n", - "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_2}\")\n", - "\n", - "# Simulating LLM response based on the problem description\n", - "# The problem description states: \"Generally true, but ignores individual variation, severity of conditions, and that exercise alone isn't sufficient for serious mental health issues\"\n", - "# An LLM would likely respond 'True' because the statement is generally accepted as true, overlooking the nuances.\n", - "response_2_content = llm.invoke(llm_judge_prompt_2)\n", - "response_2 = type('obj', (object,), {'content': response_2_content})() # Mocking the response object\n", - "print(f\"LLM Response: {response_2.content}\")\n", - "\n", - "print(\"\\n**What goes wrong:** The LLM will likely respond 'True' because the statement is broadly accepted, despite the significant nuances and exceptions. It oversimplifies a complex topic into a binary answer.\")\n", - "print(\"**Hidden Flaw:** The LLM's binary 'True/False' judgment fails to capture the conditional nature or limitations of the claim. It struggles with statements that are 'mostly true' but not universally or unconditionally true, especially when the context provided supports the general truth without elaborating on exceptions.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Naive Approach 3: Direct Scoring\n", - "\n", - "When true/false feels too limiting, people turn to scoring:\n", - "- \"Numbers are objective!\"\n", - "- \"1-10 scale feels scientific\"\n", - "- But without clear criteria, scores become arbitrary..." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "### Naive Approach 3: Direct Scoring\n", - "A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.\n", - "\n", - "User Question: Explain how photosynthesis works\n", - "Model Answer: Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.\n", - "LLM Judge Prompt:\n", - "\n", - "Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.\n", - "Question: 'Explain how photosynthesis works'\n", - "Answer: 'Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.'\n", - "\n", - "LLM Response (Example Run 1): content='8\\n\\nThis answer provides a clear and concise explanation of the basic process of photosynthesis, including the key components involved (chlorophyll, sunlight, CO2, H2O, glucose, and O2). However, it lacks detail and does not mention the overall equation for photosynthesis or the role of light-dependent reactions.' additional_kwargs={} response_metadata={'model': 'llama3.1:8b', 'created_at': '2025-09-17T05:45:08.176698Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1674546042, 'load_duration': 76024167, 'prompt_eval_count': 69, 'prompt_eval_duration': 308406792, 'eval_count': 67, 'eval_duration': 1289491833, 'model_name': 'llama3.1:8b'} id='run--f9fb30a3-81a1-465d-9535-53a242ab2cbf-0' usage_metadata={'input_tokens': 69, 'output_tokens': 67, 'total_tokens': 136}\n", - "\n", - "**Hidden Flaw:** The LLM lacks a transparent and consistently applied internal rubric for 'quality'. Without explicit criteria provided in the prompt, its scoring becomes arbitrary, reflecting internal stochasticity rather than a stable evaluation of the answer's merit. This means 'no clear criteria means arbitrary scoring'.\n" - ] - } - ], - "source": [ - "print(\"\\n### Naive Approach 3: Direct Scoring\")\n", - "print(\"A score can be completely arbitrary without a clear definition of what each number represents. A partially correct answer or even a hallucinated answer might receive a surprisingly high score if the LLM is designed to be helpful rather than strictly accurate.\")\n", - "# User's example details\n", - "user_question_3 = \"Explain how photosynthesis works\"\n", - "model_answer_3 = \"Plants use sunlight to make food. Chlorophyll in leaves absorbs light and converts carbon dioxide and water into glucose and oxygen.\"\n", - "llm_judge_prompt_3 = f\"\"\"\n", - "Rate this answer from 1-10 for quality. Only provide the number. Explain your reasoning.\n", - "Question: '{user_question_3}'\n", - "Answer: '{model_answer_3}'\n", - "\"\"\"\n", - "\n", - "print(f\"\\nUser Question: {user_question_3}\")\n", - "print(f\"Model Answer: {model_answer_3}\")\n", - "print(f\"LLM Judge Prompt:\\n{llm_judge_prompt_3}\")\n", - "\n", - "# Simulating LLM response based on the problem description\n", - "# The problem description explicitly shows inconsistency by running multiple times.\n", - "# We'll simulate one run and then explain the inconsistency in the analysis.\n", - "# Let's pick a plausible score for a single run.\n", - "response_3= llm.invoke(llm_judge_prompt_3)\n", - "response_3 = type('obj', (object,), {'content': response_3 })() # Mocking the response object\n", - "print(f\"LLM Response (Example Run 1): {response_3.content}\")\n", - "\n", - "# To demonstrate inconsistency as per the original example, we would run it multiple times:\n", - "# For illustrative purposes in the explanation, we can mention a range.\n", - "# Example scores from multiple runs could be 8, 7, 9.\n", - "# print(f\"LLM Response (Example Run 2): 7\")\n", - "# print(f\"LLM Response (Example Run 3): 9\")\n", - "\n", - "print(\"\\n**Hidden Flaw:** The LLM lacks a transparent and consistently applied internal rubric for 'quality'. Without explicit criteria provided in the prompt, its scoring becomes arbitrary, reflecting internal stochasticity rather than a stable evaluation of the answer's merit. This means 'no clear criteria means arbitrary scoring'.\")\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Naive Approach 4: Compare two options\n", - "\n", - "When direct scoring proves too arbitrary and inconsistent, people often pivot to comparing options:\n", - "\n", - "\n", - "\"Let the LLM pick the better one!\"\n", - "\"It's how humans often evaluate choices, so it must be good.\"\n", - "But this method is still susceptible to subtle biases that can skew the results" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "### Naive Approach 4: Compare two options\n", - "This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.\n", - "\n", - "User Question: Describe the city of New York.\n", - "Model Answer A: New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.\n", - "Model Answer B: New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.\n", - "LLM Judge Prompt (A first):\n", - "Which of the two following answers is better?\n", - "Answer A: 'New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.'\n", - "Answer B: 'New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.'\n", - "LLM Response (A first): After analyzing both answers, I would say that **Answer A** is better for several reasons:\n", - "\n", - "1. **Specificity**: Answer A provides specific examples of New York City's significance (finance, culture, media) and iconic landmarks (Statue of Liberty, Times Square), making it more informative and engaging.\n", - "2. **Vivid language**: The use of words like \"global hub,\" \"iconic,\" \"diverse neighborhoods,\" and \"bustling atmosphere\" creates a richer and more immersive experience for the reader.\n", - "3. **Clear structure**: Answer A follows a logical structure, starting with an overview of New York City's significance and then highlighting its notable features.\n", - "4. **Engagement**: The description in Answer A is likely to pique the interest of readers who are interested in travel or learning about new places.\n", - "\n", - "In contrast, Answer B:\n", - "\n", - "1. **Lacks specificity**: It uses vague terms like \"lots of tall buildings\" and \"famous spots,\" which don't provide much insight into what makes New York City unique.\n", - "2. **Uses simple language**: While simplicity can be beneficial for some audiences, it doesn't add depth or interest to the description in this case.\n", - "3. **Fails to engage**: The description is more generic and less likely to capture the reader's attention.\n", - "\n", - "Overall, Answer A provides a more detailed, engaging, and informative description of New York City, making it the better choice.\n", - "\n", - "**Hidden flaw (Verbosity Bias, as per your description):** The judge will almost certainly favor the longer, more verbose Answer A, associating greater length with higher quality, even though Answer B is not necessarily wrong or inadequate for certain contexts. The LLM's response often reflects this preference by citing more detail or sophisticated language.\n", - "\n", - "LLM Judge Prompt (B first):\n", - "Which of the two following answers is better?\n", - "Answer A: 'New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.'\n", - "Answer B: 'New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.'\n", - "LLM Response (B first): Answer B is significantly better than Answer A for several reasons:\n", - "\n", - "1. **Specificity**: Answer B provides specific details about New York City, such as its status as a global hub for finance, culture, and media, which gives the reader a clearer understanding of what the city has to offer.\n", - "2. **Accuracy**: The information in Answer B is more accurate and up-to-date compared to Answer A. For example, it correctly identifies New York City as the most populous city in the United States (although this may change over time).\n", - "3. **Organization**: Answer B presents its information in a clear and organized manner, making it easier for the reader to follow.\n", - "4. **Style**: The language used in Answer B is more formal and polished than Answer A, which makes it more suitable for academic or professional writing.\n", - "\n", - "Answer A, on the other hand, is more general and lacks specific details about New York City. It also uses vague terms like \"lots of tall buildings\" and \"famous spots,\" which don't provide much insight into what makes the city unique.\n", - "\n", - "Overall, Answer B is a better choice because it provides more accurate, specific, and organized information that gives the reader a clearer understanding of New York City's characteristics.\n", - "\n", - "**Hidden flaw (Position Bias, as per your description):** If the order of the answers were swapped (as demonstrated in the second prompt), there is a chance the LLM would favor the new 'Answer A' (which is now the less verbose one), demonstrating an unconscious preference for the first item presented. This bias is particularly problematic for instances where the answers are of similar quality, as the LLM's preference can be swayed by the arbitrary ordering.\n" - ] - } - ], - "source": [ - "print(\"\\n### Naive Approach 4: Compare two options\")\n", - "print(\"This method, popular in preference fine-tuning, is still susceptible to several biases, including position bias and verbosity bias. An LLM might prefer a response based on its length or its position in the prompt rather than its actual quality.\")\n", - "\n", - "# User's example details\n", - "user_question_4 = \"Describe the city of New York.\"\n", - "model_answer_A_4 = \"New York City is a global hub for finance, culture, and media. It is the most populous city in the United States and is home to iconic landmarks like the Statue of Liberty and Times Square. Its diverse neighborhoods and bustling atmosphere make it a unique and dynamic place to visit.\"\n", - "model_answer_B_4 = \"New York City is a place with lots of tall buildings and famous spots. It has many different people from all over the world. It can be a very busy and exciting place to be.\"\n", - "\n", - "# Prompt for Verbosity Bias (A is more verbose)\n", - "llm_judge_prompt_4_verbosity = f\"\"\"Which of the two following answers is better?\n", - "Answer A: '{model_answer_A_4}'\n", - "Answer B: '{model_answer_B_4}'\"\"\"\n", - "\n", - "print(f\"\\nUser Question: {user_question_4}\")\n", - "print(f\"Model Answer A: {model_answer_A_4}\")\n", - "print(f\"Model Answer B: {model_answer_B_4}\")\n", - "print(f\"LLM Judge Prompt (A first):\\n{llm_judge_prompt_4_verbosity}\")\n", - "\n", - "response_4_verbosity = llm.invoke(llm_judge_prompt_4_verbosity)\n", - "print(f\"LLM Response (A first): {response_4_verbosity.content}\")\n", - "\n", - "print(\"\\n**Hidden flaw (Verbosity Bias, as per your description):** The judge will almost certainly favor the longer, more verbose Answer A, associating greater length with higher quality, even though Answer B is not necessarily wrong or inadequate for certain contexts. The LLM's response often reflects this preference by citing more detail or sophisticated language.\")\n", - "\n", - "# Prompt for Position Bias (swapping A and B)\n", - "llm_judge_prompt_4_position = f\"\"\"Which of the two following answers is better?\n", - "Answer A: '{model_answer_B_4}'\n", - "Answer B: '{model_answer_A_4}'\"\"\" # Swapped order\n", - "\n", - "print(f\"\\nLLM Judge Prompt (B first):\\n{llm_judge_prompt_4_position}\")\n", - "\n", - "response_4_position = llm.invoke(llm_judge_prompt_4_position)\n", - "print(f\"LLM Response (B first): {response_4_position.content}\")\n", - "\n", - "print(\"\\n**Hidden flaw (Position Bias, as per your description):** If the order of the answers were swapped (as demonstrated in the second prompt), there is a chance the LLM would favor the new 'Answer A' (which is now the less verbose one), demonstrating an unconscious preference for the first item presented. This bias is particularly problematic for instances where the answers are of similar quality, as the LLM's preference can be swayed by the arbitrary ordering.\")\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary: The Path Forward\n", - "\n", - "### What We've Learned:\n", - "\n", - "**The Problem:**\n", - "1. **Traditional evaluation methods** don't work for modern AI systems\n", - "2. **AI models fail silently** without proper evaluation\n", - "3. **Naive LLM judging approaches** have hidden flaws\n", - "\n", - "**The Solution:**\n", - "1. **LLM as Judge** provides scalable, understanding\n", - "2. **Proper implementation** requires systematic approaches\n", - "3. **Success measurement** needs concrete metrics and bias detection\n", - "\n", - "### Next Steps:\n", - "In the following sessions, we'll build sophisticated solutions:\n", - "- **Session 2**: Progressive improvements and structured approaches\n", - "- **Session 3**: Production-ready systems with bias detection\n", - "- **Final Challenge**: Building comprehensive evaluation pipelines\n", - "\n", - "---\n", - "\n", - "**You now understand both the promise and perils of LLM as Judge systems. Ready to build better solutions?**" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "qna_gen", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.2" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}