# Advanced Prompt Engineering Techniques

This notebook covers advanced prompt engineering techniques beyond the basics. You'll learn reasoning enhancement patterns, iterative refinement methods, efficiency optimizations, and systematic prompt improvement workflows.

## Learning Objectives

By the end of this notebook, you will be able to:
- Implement **reasoning enhancement** patterns like Chain-of-Thought and Extended Thinking
- Apply **iterative refinement** methods such as Self-Refine and Chain-of-Verification
- Optimize for **efficiency** using Chain-of-Draft for reduced latency
- Generate **diverse outputs** with Verbalized Sampling techniques
- Apply systematic **prompt optimization workflows** using manual iteration, LLM-assisted exploration, and automated evaluation

## Why This Matters

Advanced prompt engineering techniques can dramatically improve output quality for complex tasks:
- **Chain-of-Thought** can improve accuracy on reasoning tasks by 40-60%
- **Extended Thinking** enables deep analysis for complex problems
- **Systematic optimization workflows** help you find the best prompt variant without guesswork

**Duration**: ~60-90 minutes

## Prerequisites

Before running this notebook, ensure you have:
1. An AWS account with Amazon Bedrock access enabled
2. AWS credentials configured (via `.env` file, AWS CLI, or IAM role)

## Setup

In [1]:
from __future__ import annotations

from dotenv import load_dotenv

load_dotenv()

import json
import os
from collections import Counter

import boto3

# Initialize Bedrock clients
REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")
bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION)
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=REGION)

# Model configurations
MODEL_SONNET = "global.anthropic.claude-sonnet-4-5-20250929-v1:0"
MODEL_HAIKU = "global.anthropic.claude-haiku-4-5-20251001-v1:0"
MODEL_HAIKU_35 = "us.anthropic.claude-3-5-haiku-20241022-v1:0"
MODEL_ID = MODEL_HAIKU


def call_bedrock(prompt, system=None, temperature=0.0, max_tokens=1000, model_id=None):
    """
    Helper function to call Bedrock Converse API.
    """
    model = model_id or MODEL_ID
    messages = [{"role": "user", "content": [{"text": prompt}]}]

    kwargs = {
        "modelId": model,
        "messages": messages,
        "inferenceConfig": {"maxTokens": max_tokens, "temperature": temperature},
    }

    if system:
        kwargs["system"] = [{"text": system}]

    response = bedrock_runtime.converse(**kwargs)

    return {
        "text": response["output"]["message"]["content"][0]["text"],
        "input_tokens": response["usage"]["inputTokens"],
        "output_tokens": response["usage"]["outputTokens"],
        "latency_ms": response["metrics"]["latencyMs"],
    }


print(f"Region: {REGION}")
print(f"Default Model: {MODEL_ID}")
print("\nSetup complete!")

Region: us-east-1
Default Model: global.anthropic.claude-haiku-4-5-20251001-v1:0

Setup complete!


<div class="alert alert-block alert-info">
<b>Note:</b> This notebook uses Claude models with the <b>global</b> CRIS (Cross-Region Inference Service) profile for higher availability and ~10% cost savings.
</div>

---

# Category 1: Foundation Techniques

Foundation techniques are the building blocks of effective prompts. These are covered in detail in [02-optimization-strategy.ipynb](../01-basics/02-optimization-strategy.ipynb) Section 2 & 3:

- **Clear Instructions**: Specific, unambiguous directions
- **Few-Shot Examples**: 2-5 input/output examples
- **Structured Output**: JSON/XML format specifications
- **Parameter Tuning**: Temperature, max_tokens optimization

<div class="alert alert-block alert-info">
<b>Note:</b> If you haven't completed the <code>01-basics/</code> notebooks, we recommend reviewing them first. The techniques covered there are prerequisites for the advanced techniques in this notebook.
</div>

**Quick refresher on when to apply foundation techniques:**
- Use **clear instructions** on every prompt (specify format, length, constraints)
- Add **few-shot examples** when you need consistent output format
- Use **structured output** (JSON/tool use) for programmatic parsing
- Set **temperature=0** for factual tasks, higher for creative tasks

---

# Category 2: Reasoning Enhancement

Reasoning enhancement techniques help models tackle complex, multi-step problems more reliably. These techniques are essential when tasks require logical deduction, mathematical calculations, or careful analysis.

| Technique | When to Use | Trade-off |
|-----------|-------------|-----------|
| **Chain-of-Thought (CoT)** | Math, logic, multi-step reasoning | +tokens, +latency |
| **Extended Thinking** | Complex analysis, coding, research | +tokens (thinking budget) |

## 2.1 Chain-of-Thought (CoT)

Chain-of-Thought prompting asks the model to show its reasoning process before arriving at an answer. This dramatically improves performance on problems where the direct answer approach often fails.

**How it works:**
1. Prompt explicitly requests step-by-step reasoning
2. Model generates intermediate steps
3. Final answer is derived from the reasoning chain

<div class="alert alert-block alert-warning">
<b>Trade-off:</b> CoT increases output tokens and latency. Use it when accuracy matters more than speed.
</div>

In [2]:
# Demo: Where direct answer fails but CoT succeeds
print("=" * 70)
print("DEMO: Chain-of-Thought vs Direct Answer (Using Haiku 3.5)")
print("=" * 70)

# Letter counting is notoriously hard for older LLMs without explicit reasoning
tricky_problem = "How many times does the letter 'r' appear in the word 'strawberry'?"

# Direct prompt - models often miscount
direct_prompt = f"{tricky_problem}\n\nAnswer with just the number."

# Chain-of-Thought prompt
cot_prompt = f"""{tricky_problem}

Think step by step: list each letter with its position, identify each 'r', then count them."""

print(f"\nProblem: {tricky_problem}")
print("\n" + "-" * 70)

print("\n--- Direct Answer (Haiku 3.5) ---")
direct_response = call_bedrock(direct_prompt, model_id=MODEL_HAIKU_35)
print(f"Response: {direct_response['text']}")
print(f"Tokens: {direct_response['output_tokens']}")

print("\n--- Chain-of-Thought (Haiku 3.5) ---")
cot_response = call_bedrock(cot_prompt, model_id=MODEL_HAIKU_35)
print(f"Response:\n{cot_response['text']}")
print(f"\nTokens: {cot_response['output_tokens']}")

print("\n" + "=" * 70)
print("Correct answer: 3 (st-r-awbe-r-r-y: positions 3, 8, 9)")
print("=" * 70)

DEMO: Chain-of-Thought vs Direct Answer (Using Haiku 3.5)

Problem: How many times does the letter 'r' appear in the word 'strawberry'?

----------------------------------------------------------------------

--- Direct Answer (Haiku 3.5) ---
Response: 2
Tokens: 5

--- Chain-of-Thought (Haiku 3.5) ---
Response:
Let's solve this step by step:

1. Write out the word: strawberry

2. List each letter with its position:
   s (1st)
   t (2nd)
   r (3rd) - First 'r'
   a (4th)
   w (5th)
   b (6th)
   e (7th)
   r (8th) - Second 'r'
   r (9th) - Third 'r'
   y (10th)

3. Count the number of 'r's:
   There are 3 'r's in the word 'strawberry'

The answer is 3.

Tokens: 156

Correct answer: 3 (st-r-awbe-r-r-y: positions 3, 8, 9)


## 2.2: Extended Thinking (Thinking Tokens)

Instead of explicitly asking the model to "think step by step", some Claude models support **extended thinking** - a native feature that gives the model a dedicated "thinking budget" for internal reasoning.

**Key differences from explicit CoT:**
- Reasoning happens in a separate `thinking` block (not mixed with output)
- Model can explore multiple approaches before committing
- Billed for full thinking tokens, but output is cleaner

| Model | Thinking Output | Notes |
|-------|-----------------|-------|
| Claude 3.7 Sonnet | Full thinking output | See all reasoning |
| Claude 4+ (Sonnet, Opus, Haiku) | Summarized thinking | Billed for full, see summary |

<div></div>

| Parameter | Description |
|-----------|-------------|
| `budget_tokens` | Maximum tokens for thinking (minimum 1,024) |
| `type` | Set to `"enabled"` to activate extended thinking |

<div class="alert alert-block alert-warning">
<b>Token Cost:</b> You are billed for the full thinking tokens generated, not just the summary. Claude 4+ models return summarized thinking but bill for the full amount.
</div>

In [3]:
# Demo: Extended Thinking using InvokeModel API
print("=" * 70)
print("DEMO: Extended Thinking (Thinking Tokens)")
print("=" * 70)

# Complex problem that benefits from extended thinking
complex_problem = """A farmer has a fox, a chicken, and a bag of grain. He needs to cross a river 
in a boat that can only carry him and one item at a time. If left alone:
- The fox will eat the chicken
- The chicken will eat the grain

How can the farmer get all three across safely? List the minimum steps."""


def call_with_thinking(prompt, budget_tokens=2048, max_tokens=4000):
    """
    Call Claude with extended thinking enabled via InvokeModel API.
    """
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": max_tokens,
        "thinking": {"type": "enabled", "budget_tokens": budget_tokens},
        "messages": [{"role": "user", "content": prompt}],
    }

    response = bedrock_runtime.invoke_model(modelId=MODEL_SONNET, body=json.dumps(request_body))

    result = json.loads(response["body"].read())

    # Extract thinking and text blocks
    thinking_text = None
    answer_text = None

    for block in result.get("content", []):
        if block.get("type") == "thinking":
            thinking_text = block.get("thinking", "")
        elif block.get("type") == "text":
            answer_text = block.get("text", "")

    return {"thinking": thinking_text, "answer": answer_text, "usage": result.get("usage", {})}


print(f"\nProblem: {complex_problem}")
print("\n" + "-" * 70)

result = call_with_thinking(complex_problem, budget_tokens=2048)

print("\n--- Thinking (Summarized) ---")
if result["thinking"]:
    thinking_preview = result["thinking"][:500]
    print(thinking_preview + "..." if len(result["thinking"]) > 500 else thinking_preview)

print("\n--- Final Answer ---")
print(result["answer"])

print("\n--- Raw Token Usage ---")
print(json.dumps(result["usage"], indent=2))

print("\n" + "=" * 70)

DEMO: Extended Thinking (Thinking Tokens)

Problem: A farmer has a fox, a chicken, and a bag of grain. He needs to cross a river 
in a boat that can only carry him and one item at a time. If left alone:
- The fox will eat the chicken
- The chicken will eat the grain

How can the farmer get all three across safely? List the minimum steps.

----------------------------------------------------------------------

--- Thinking (Summarized) ---
This is a classic river crossing puzzle. Let me work through it step by step.

Starting position:
- Left bank: Farmer, Fox, Chicken, Grain
- Right bank: (empty)

Constraints:
- Can't leave Fox and Chicken alone together (fox eats chicken)
- Can't leave Chicken and Grain alone together (chicken eats grain)
- Fox and Grain can be left alone together (they're safe)

Let me find the solution:

Step 1: Farmer takes Chicken across
- Left bank: Fox, Grain
- Right bank: Farmer, Chicken
(Fox and Grain ar...

--- Final Answer ---
# Solution: 7 Steps (Minimum)



---

# Category 3: Iterative Refinement

Iterative refinement techniques use multiple LLM calls to progressively improve output quality.

<div class="alert alert-block alert-danger">
<b>Important:</b> These techniques represent <b>extreme optimization scenarios</b> and are not widely adopted due to significantly increased cost and latency (3-4x API calls). Apply them only in <b>high-stakes scenarios</b> where quality is paramount and the cost is justified.
</div>

| Technique | When to Use | API Calls | Cost Multiplier |
|-----------|-------------|----------|-----------------|
| **Self-Refine** | Creative writing, code generation | 3x | ~3x |
| **Chain-of-Verification** | Factual accuracy, citations | 4x | ~4x |

## 3.1 Self-Refine (Generate -> Critique -> Refine)

Self-Refine is a three-step process that mimics human revision:

1. **Generate**: Create an initial response (can use smaller/cheaper model)
2. **Critique**: Analyze the response for weaknesses (use stronger model for better critique)
3. **Refine**: Improve based on the critique (can use smaller model)

<div class="alert alert-block alert-info">
<b>Cost Optimization:</b> Consider using a stronger model (e.g., Sonnet) for critique and a smaller model (e.g., Haiku) for generation and refinement. The critique step benefits most from model capability.
</div>

In [4]:
# Demo: Self-Refine with different models for each step
print("=" * 70)
print("DEMO: Self-Refine (Strong Critique, Efficient Generate/Refine)")
print("=" * 70)

task = "Write a product description for a wireless noise-canceling headphone."

# Configuration: Use Haiku for generation/refinement, Sonnet for critique
GENERATE_MODEL = MODEL_HAIKU  # Cheaper for initial generation
CRITIQUE_MODEL = MODEL_SONNET  # Stronger for quality critique
REFINE_MODEL = MODEL_HAIKU  # Cheaper for refinement

# Step 1: Initial generation (cheaper model)
print("\n--- Step 1: Initial Generation (Haiku) ---")
initial_prompt = f"{task}\n\nWrite a concise 2-3 sentence product description."
initial_response = call_bedrock(initial_prompt, model_id=GENERATE_MODEL)
initial_text = initial_response["text"]
print(initial_text)

# Step 2: Critique (stronger model for better analysis)
print("\n--- Step 2: Critique (Sonnet) ---")
critique_prompt = f"""Review this product description and identify specific improvements:

Description: {initial_text}

Evaluate on: clarity, persuasiveness, key features, conciseness.
List 2-3 specific improvements needed:"""

critique_response = call_bedrock(critique_prompt, model_id=CRITIQUE_MODEL)
critique_text = critique_response["text"]
print(critique_text)

# Step 3: Refine (cheaper model can follow clear critique)
print("\n--- Step 3: Refined Output (Haiku) ---")
refine_prompt = f"""Rewrite this product description incorporating the feedback:

Original: {initial_text}

Feedback: {critique_text}

Write an improved 2-3 sentence description:"""

refined_response = call_bedrock(refine_prompt, model_id=REFINE_MODEL)
print(refined_response["text"])

# Cost comparison
print("\n" + "=" * 70)
print("Cost Analysis:")
print(f"  Generate (Haiku):  {initial_response['input_tokens'] + initial_response['output_tokens']} tokens")
print(f"  Critique (Sonnet): {critique_response['input_tokens'] + critique_response['output_tokens']} tokens")
print(f"  Refine (Haiku):    {refined_response['input_tokens'] + refined_response['output_tokens']} tokens")
print("\nUsing Sonnet only for critique optimizes cost while maintaining quality.")
print("=" * 70)

DEMO: Self-Refine (Strong Critique, Efficient Generate/Refine)

--- Step 1: Initial Generation (Haiku) ---
# SoundShield Pro Wireless Headphones

Experience crystal-clear audio in any environment with advanced active noise cancellation that blocks up to 99% of ambient sound, letting you focus on what matters most. These premium wireless headphones deliver 30+ hours of battery life, seamless Bluetooth connectivity, and premium comfort for all-day wear. Perfect for travel, work, or everyday listening—immerse yourself in pure sound, distraction-free.

--- Step 2: Critique (Sonnet) ---
# Review of SoundShield Pro Product Description

## Overall Assessment
The description is solid but has room for optimization in specificity and structure.

## Evaluation by Criteria

**Clarity:** Good - Easy to understand, though some claims lack context
**Persuasiveness:** Moderate - Makes bold claims but needs more credibility
**Key Features:** Present but underdeveloped - Missing technical details
**Conc

## 3.2 Chain-of-Verification (CoVe)

Chain-of-Verification reduces hallucinations by generating verification questions and cross-checking claims.

**The CoVe Process:**
1. Generate initial response
2. Create verification questions for key claims
3. Answer verification questions independently
4. Produce final verified response

<div class="alert alert-block alert-warning">
<b>Use Case:</b> Only for high-stakes factual content where accuracy is critical (medical, legal, financial). The 4x cost is rarely justified for general use cases.
</div>

In [5]:
# Demo: Chain-of-Verification
print("=" * 70)
print("DEMO: Chain-of-Verification (CoVe)")
print("=" * 70)

question = "What are 3 key features of Amazon S3?"

# Step 1: Initial response
print("\n--- Step 1: Initial Response ---")
initial = call_bedrock(question)
initial_text = initial["text"]
print(initial_text)

# Step 2: Generate verification questions
print("\n--- Step 2: Verification Questions ---")
verify_q_prompt = f"""Given this response, generate 3 verification questions to check accuracy:

Response: {initial_text}

List 3 verification questions:"""

verify_q_response = call_bedrock(verify_q_prompt)
print(verify_q_response["text"])

# Step 3: Answer verification questions
print("\n--- Step 3: Verification Answers ---")
verification_prompt = f"""Answer these questions about Amazon S3:

{verify_q_response["text"]}

Provide brief factual answers:"""

verification_response = call_bedrock(verification_prompt)
print(verification_response["text"])

# Step 4: Final verified response
print("\n--- Step 4: Final Verified Response ---")
final_prompt = f"""Based on the verification, provide a corrected final answer.

Original: {initial_text}
Verification: {verification_response["text"]}

Final verified answer about 3 key features of Amazon S3:"""

final_response = call_bedrock(final_prompt)
print(final_response["text"])

print("\n" + "=" * 70)
print("Note: 4 API calls total - use only when accuracy is critical.")
print("=" * 70)

DEMO: Chain-of-Verification (CoVe)

--- Step 1: Initial Response ---
# 3 Key Features of Amazon S3

1. **Scalability & Durability**
   - Automatically scales to handle any amount of data
   - Provides 99.999999999% (11 nines) durability by replicating data across multiple facilities
   - No need to provision storage capacity upfront

2. **Security & Access Control**
   - Encryption options (in-transit and at-rest)
   - Fine-grained access control through IAM policies, bucket policies, and ACLs
   - Versioning and MFA delete protection available

3. **Cost-Effective & Flexible**
   - Pay only for what you use (no upfront costs)
   - Multiple storage classes (Standard, Intelligent-Tiering, Glacier, etc.) for different use cases
   - Lifecycle policies to automatically transition data to cheaper storage tiers

These features make S3 ideal for backup, archiving, content distribution, and data lakes.

--- Step 2: Verification Questions ---
# 3 Verification Questions

1. **Durability Claim V

---

# Category 4: Efficiency Techniques

Efficiency techniques maintain reasoning quality while reducing token usage and latency. Essential for production systems.

## 4.1 Chain-of-Draft (Concise Reasoning)

Chain-of-Draft produces minimal reasoning traces while maintaining accuracy. It preserves step-by-step thinking while dramatically reducing output tokens.

**Key instruction patterns:**
- "Keep each step to ~5 words max"
- "Use abbreviations"
- "Format: Step -> Result"

<div class="alert alert-block alert-success">
<b>Production Tip:</b> Chain-of-Draft can reduce output tokens by 50-70% while maintaining similar accuracy to full CoT.
</div>

In [6]:
# Demo: Chain-of-Draft vs Full Chain-of-Thought
print("=" * 70)
print("DEMO: Chain-of-Draft (Efficient Reasoning)")
print("=" * 70)

problem = """A company has 120 employees. 40% work in engineering, 25% in sales, 
and the rest in operations. How many employees are in operations?"""

# Full Chain-of-Thought
full_cot_prompt = f"""{problem}

Think step by step, explaining your reasoning in detail:"""

# Chain-of-Draft
cod_prompt = f"""{problem}

Think step by step, but keep each step to ~5 words max.
Format: Step -> Result
Then final answer."""

print("\n--- Full Chain-of-Thought ---")
full_response = call_bedrock(full_cot_prompt)
print(full_response["text"])
print(f"\nTokens: {full_response['output_tokens']} | Latency: {full_response['latency_ms']}ms")

print("\n--- Chain-of-Draft ---")
cod_response = call_bedrock(cod_prompt)
print(cod_response["text"])
print(f"\nTokens: {cod_response['output_tokens']} | Latency: {cod_response['latency_ms']}ms")

print("\n" + "=" * 70)
token_reduction = (1 - cod_response["output_tokens"] / full_response["output_tokens"]) * 100
print(f"Token reduction: {token_reduction:.1f}%")
print("Correct answer: 42 employees (120 - 48 - 30 = 42)")
print("=" * 70)

DEMO: Chain-of-Draft (Efficient Reasoning)

--- Full Chain-of-Thought ---
# Finding the Number of Operations Employees

Let me work through this step by step.

## Step 1: Identify what we know
- Total employees: 120
- Engineering: 40%
- Sales: 25%
- Operations: the rest (unknown percentage)

## Step 2: Find the percentage in operations
Since all employees must be accounted for:
- Operations % = 100% - 40% - 25% = **35%**

## Step 3: Calculate the number of operations employees
- Operations employees = 35% of 120
- Operations employees = 0.35 × 120
- Operations employees = **42**

## Verification
Let me check this makes sense:
- Engineering: 40% × 120 = 48 employees
- Sales: 25% × 120 = 30 employees
- Operations: 35% × 120 = 42 employees
- Total: 48 + 30 + 42 = 120 ✓

**Answer: 42 employees work in operations**

Tokens: 243 | Latency: 3014ms

--- Chain-of-Draft ---
Step -> Result
Total employees: 120
Engineering percentage: 40%
Engineering employees: 120 × 0.40 = 48
Sales percentage: 25

---

# Category 5: Diversity Techniques

Diversity techniques generate multiple varied outputs efficiently. Valuable for creative tasks, brainstorming, and exploring solution spaces.

## 5.1 Verbalized Sampling

### The Problem: Temperature Alone Doesn't Guarantee Diversity

A common assumption is that calling the same prompt multiple times with high temperature will produce diverse outputs. In practice, models often converge on similar patterns despite temperature settings.

In [7]:
# Demo Part 1: The Problem - Temperature doesn't guarantee diversity
print("=" * 70)
print("PROBLEM: Multiple Calls with Temperature Still Converge")
print("=" * 70)

task = "Write a short story about a bear."
simple_prompt = f"{task}"

print(f"\nTask: {task}")
print("Calling 5 times with temperature=0.9...\n")

for i in range(5):
    response = call_bedrock(simple_prompt, temperature=0.9, max_tokens=100)
    print(f"{i + 1}. {response['text'][:150]}..." if len(response["text"]) > 150 else f"{i + 1}. {response['text']}")
    print()

PROBLEM: Multiple Calls with Temperature Still Converge

Task: Write a short story about a bear.
Calling 5 times with temperature=0.9...

1. # The Bear's Gift

The old bear had lived in the mountain for forty winters. Her fur was silver-tipped now, and her left ear had a notch from a fight ...

2. # The Bear's Gift

On a cold autumn morning, a black bear named Milo wandered through the forest, searching for berries before winter. His stomach rum...

3. # The Bear's Gift

On a cold autumn morning, a young bear named Copper stood at the edge of Willow Creek, watching salmon leap upstream. His mother ha...

4. # The Bear's Gift

The old bear shuffled through the autumn forest, his dark fur matted and graying at the edges. His name was Kodiak, though no one h...

5. # The Bear's Gift

The old bear had seen many winters. His fur, once jet black, now carried streaks of silver along his shoulders. He moved through th...



Even with high temperature, outputs often follow similar narrative patterns.

### The Solution: Verbalized Sampling

Verbalized Sampling explicitly requests multiple distinct variations in a single call by specifying **probability distributions** and **diversity requirements** directly in the prompt.

**Key insight**: Instead of relying on randomness (temperature), we verbally instruct the model to sample from different parts of the distribution.

In [8]:
# Demo Part 2: The Solution - Verbalized Sampling
print("=" * 70)
print("SOLUTION: Verbalized Sampling (Explicit Diversity)")
print("=" * 70)

task = "Write a short story about a bear."

verbalized_system = """You are a helpful assistant. For each query, please generate a set of five possible responses, 
each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. 
Please sample from the tails of the distribution, such that each response explores a different creative direction."""

verbalized_prompt = f"{task} Write just 2-3 sentences per response."

print("\nUsing verbalized sampling (single call, temp=0.3)...\n")

verbalized_response = call_bedrock(verbalized_prompt, system=verbalized_system, temperature=0.3, max_tokens=800)
print(verbalized_response["text"])

SOLUTION: Verbalized Sampling (Explicit Diversity)

Using verbalized sampling (single call, temp=0.3)...

<response>
<text>
A grizzly bear named Mabel discovered an abandoned cabin in the woods and decided to make it her winter den, decorating it with wildflowers she'd collected all summer. Every morning, she'd sit on the porch watching the sunrise, wondering about the humans who used to live there. By spring, she'd filled three journals with sketches of the forest, never knowing she'd become an artist.
</text>
<probability>0.18</probability>
</response>

<response>
<text>
In a distant future, a sentient polar bear named Kess worked as a climate scientist, studying the ice sheets her ancestors once roamed. She transmitted her findings across the galaxy, hoping other species would learn from Earth's mistakes before it was too late. Her greatest discovery wasn't about the ice—it was that hope itself could be measured and shared.
</text>
<probability>0.17</probability>
</response>

<respo

---

# Prompt Optimization Workflows

As prompts move from prototype to production, they often underperform on edge cases, produce inconsistent outputs, or fail to meet quality requirements. Ad-hoc tweaking leads to unpredictable results and wasted effort.

**Structured optimization workflows** provide systematic approaches to improve prompts:
- Identify what's not working
- Generate candidate improvements
- Evaluate objectively
- Select the best variant

| Typical Workflow | Best For | Effort | Key Approach |
|----------|----------|--------|-------------|
| **Manual Iteration** | Domain expertise, full control | High | Human expert iteration |
| **LLM-Assisted** | Quick exploration, inspiration | Low | Use LLM to suggest variations |
| **Automated Selection** | Objective ranking with data | Medium | Evaluate against test dataset |
| **Framework-Driven** | Complex multi-step pipelines | High | DSPy or similar frameworks |

## Workflow 1: Manual Iteration

**When to use**:
- You have deep domain expertise
- Need full control over prompt evolution
- Small-scale, high-value applications

**Process**: Start simple, identify issues, progressively add structure and constraints.

In [9]:
# Demo: Manual prompt iteration
print("=" * 70)
print("WORKFLOW 1: Manual Prompt Iteration")
print("=" * 70)

test_input = "Customer says: I've been waiting 3 weeks for my order and no one is helping me!"

versions = [
    ("V1: Basic", f"Respond to this complaint:\n{test_input}"),
    (
        "V2: + Persona",
        f"""You are a helpful customer service agent.

Respond to this complaint:\n{test_input}""",
    ),
    (
        "V3: + Structure",
        f"""You are a customer service agent for CloudShop.

Respond to: {test_input}

Requirements:
- Start with empathy
- Acknowledge the specific issue
- Provide concrete next steps
- Keep under 100 words""",
    ),
]

for name, prompt in versions:
    print(f"\n--- {name} ---")
    response = call_bedrock(prompt, max_tokens=200)
    print(response["text"])
    print(f"[Tokens: {response['output_tokens']}]")

WORKFLOW 1: Manual Prompt Iteration

--- V1: Basic ---
# Response to Customer Complaint

Thank you for reaching out, and I sincerely apologize for the frustration you're experiencing with your order delay.

A 3-week wait is unacceptable, and I understand your concern. Here's how I'd like to help:

**Immediate next steps:**
1. **Order details** – Could you please provide your order number so I can investigate the status right away?
2. **What I'll do** – I'll personally track down where your order is and identify what's causing the delay
3. **Timeline** – I'll get back to you within [24 hours] with a specific update and resolution

**In the meantime:**
- If your order has been lost or significantly delayed beyond our standard timeframe, we'll make this right—whether that's expedited shipping, a replacement, or a refund
- You'll have direct contact with me to avoid the runa
[Tokens: 200]

--- V2: + Persona ---
# Thank you for reaching out, and I sincerely apologize for the frustration you

## Workflow 2: LLM-Assisted Exploration

**When to use**:
- Quick exploration of prompt space
- Need inspiration for variations
- Limited time for manual iteration

**Process**: Use the LLM itself to generate prompt variations and improvements.

In [10]:
# Demo: LLM-assisted prompt exploration
print("=" * 70)
print("WORKFLOW 2: LLM-Assisted Prompt Exploration")
print("=" * 70)

base_prompt = """Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL.

Review: {review}
Sentiment:"""

meta_prompt = f"""You are a prompt engineering expert. Suggest 3 improvements for this prompt:

```
{base_prompt}
```

For each: explain why it helps, show the modified prompt.
"""

print("\n--- LLM-Generated Variations ---")
response = call_bedrock(meta_prompt, max_tokens=1500, model_id=MODEL_SONNET)
print(response["text"])

WORKFLOW 2: LLM-Assisted Prompt Exploration

--- LLM-Generated Variations ---
# 3 Improvements for Your Sentiment Classification Prompt

## Improvement 1: Add Examples (Few-Shot Learning)

**Why it helps:** Providing examples clarifies your expectations, reduces ambiguity about edge cases, and significantly improves accuracy. The model learns your specific classification criteria.

**Modified prompt:**
```
Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL.

Examples:
Review: "This product exceeded my expectations! Love it."
Sentiment: POSITIVE

Review: "Terrible quality. Broke after one day."
Sentiment: NEGATIVE

Review: "It arrived on time. Standard packaging."
Sentiment: NEUTRAL

Review: {review}
Sentiment:
```

## Improvement 2: Add Clear Definitions and Edge Case Handling

**Why it helps:** Defines boundaries between categories, especially for mixed sentiments or neutral statements. This reduces inconsistent classifications and helps with nuanced reviews.

**Modified prompt:**
`

### Variant: Bedrock OptimizePrompt API

Amazon Bedrock provides a native **OptimizePrompt** API that automatically rewrites prompts for better performance with specific models. This is a variant of LLM-assisted exploration.

**How it works:**
1. Submit your prompt and target model
2. Bedrock analyzes the prompt structure
3. Returns an optimized version following model-specific best practices

In [11]:
# Demo: Bedrock OptimizePrompt API
print("=" * 70)
print("WORKFLOW 2 VARIANT: Bedrock OptimizePrompt API")
print("=" * 70)

original_prompt = """Classify customer support tickets into categories. 
The categories are: billing, technical, account, and general.
Ticket: {ticket}
Category:"""

TARGET_MODEL = "anthropic.claude-sonnet-4-5-20250929-v1:0"

print("\n--- Original Prompt ---")
print(original_prompt)


def optimize_prompt(prompt_text, target_model_id):
    response = bedrock_agent_runtime.optimize_prompt(
        input={"textPrompt": {"text": prompt_text}}, targetModelId=target_model_id
    )

    result = {"analysis": None, "optimized_prompt": None}

    for event in response["optimizedPrompt"]:
        if "analyzePromptEvent" in event:
            result["analysis"] = event["analyzePromptEvent"].get("message", "")
        elif "optimizedPromptEvent" in event:
            optimized = event["optimizedPromptEvent"].get("optimizedPrompt", {})
            if "textPrompt" in optimized:
                result["optimized_prompt"] = optimized["textPrompt"].get("text", "")

    return result


print(f"\n--- Optimizing for: {TARGET_MODEL} ---")
result = optimize_prompt(original_prompt, TARGET_MODEL)

if result["analysis"]:
    print("\n--- Analysis ---")
    print(result["analysis"])
if result["optimized_prompt"]:
    print("\n--- Optimized Prompt ---")
    print(result["optimized_prompt"])

print("\n" + "=" * 70)

WORKFLOW 2 VARIANT: Bedrock OptimizePrompt API

--- Original Prompt ---
Classify customer support tickets into categories. 
The categories are: billing, technical, account, and general.
Ticket: {ticket}
Category:

--- Optimizing for: anthropic.claude-sonnet-4-5-20250929-v1:0 ---

--- Analysis ---
Analysis of your prompt is complete

--- Optimized Prompt ---
"<instruction>\nYou are a customer support ticket classifier. Your task is to analyze the provided support ticket and assign it to exactly ONE of the following categories:\n\n<categories>\n<category name=\"billing\">Issues related to payments, invoices, charges, refunds, pricing, or subscription costs</category>\n<category name=\"technical\">Problems with product functionality, bugs, errors, performance issues, or technical difficulties</category>\n<category name=\"account\">Matters concerning user accounts, login issues, password resets, profile settings, or account access</category>\n<category name=\"general\">General inquiries, f

## Workflow 3: Automated Selection with Evaluation

**When to use**:
- Test dataset available (20+ examples)
- Need objective ranking of prompt variants
- Scaling prompt optimization

**Process**: Test prompt candidates against a labeled evaluation dataset and select the best performer.

**Why Invest in Evals?** Teams without structured evaluations often get stuck in reactive loops—fixing one failure while creating another, unable to distinguish real regressions from noise. Start small with 20-50 realistic tasks drawn from actual user failures rather than waiting for a perfect comprehensive suite. Convert the manual checks you already perform during development into structured test cases, and let metrics replace guesswork. For a deeper dive into building effective evaluation datasets, see [Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents).

<div class="alert alert-block alert-info">
<b>AWS Bedrock Evaluation:</b> For production use cases, consider using <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html">Amazon Bedrock Model Evaluation</a> which provides:
<ul>
<li><b>Automatic evaluation</b>: Built-in metrics for accuracy, robustness, toxicity</li>
<li><b>LLM-as-a-judge</b>: Use a model to evaluate another model's outputs</li>
<li><b>Human evaluation</b>: Integrate human reviewers for subjective quality</li>
</ul>
</div>

In [12]:
# Load evaluation dataset
with open("data/intent_classification_eval.json") as f:
    eval_dataset = json.load(f)

print(f"Loaded {len(eval_dataset)} evaluation examples")
print("\nSample entries:")
for item in eval_dataset[:3]:
    print(f"  - {item['message'][:50]}... -> {item['expected_intent']}")

intent_counts = Counter(item["expected_intent"] for item in eval_dataset)
print(f"\nIntent distribution: {len(intent_counts)} categories")

Loaded 40 evaluation examples

Sample entries:
  - I want to cancel my subscription immediately... -> CANCELLATION
  - How do I change my password?... -> ACCOUNT_HELP
  - My order hasn't arrived after 2 weeks, this is una... -> ORDER_ISSUE

Intent distribution: 13 categories


In [None]:
# Run automated evaluation
print("=" * 70)
print("WORKFLOW 3: AUTOMATED PROMPT EVALUATION")
print("=" * 70)

INTENTS = sorted({item["expected_intent"] for item in eval_dataset})

# Define prompt candidates - minimal vs structured
prompts = {
    # Minimal prompt - no category list, model must guess
    "minimal": """Classify this customer message into a support category.

Message: "{message}"

Category:""",
    # With category list - model knows valid options
    "with_categories": f"""Classify this customer message into one of these categories:
{", ".join(INTENTS)}

Message: "{{message}}"

Category:""",
}


def evaluate_prompt(template, dataset, sample_size=20):
    correct = 0
    results = []

    for item in dataset[:sample_size]:
        prompt = template.format(message=item["message"])
        response = call_bedrock(prompt, max_tokens=30, temperature=0)

        pred_text = response["text"].strip().upper()
        predicted = None
        for intent in INTENTS:
            if intent in pred_text:
                predicted = intent
                break

        is_correct = predicted == item["expected_intent"]
        if is_correct:
            correct += 1
        results.append({"expected": item["expected_intent"], "predicted": predicted, "correct": is_correct})

    return {"accuracy": correct / sample_size, "results": results}


EVAL_SIZE = 20
print(f"\nEvaluating {len(prompts)} prompt variants on {EVAL_SIZE} examples...\n")

scores = {}
for name, template in prompts.items():
    print(f"Evaluating '{name}'...", end=" ")
    result = evaluate_prompt(template, eval_dataset, sample_size=EVAL_SIZE)
    scores[name] = result
    print(f"{result['accuracy'] * 100:.1f}%")

best = max(scores, key=lambda x: scores[x]["accuracy"])
print(f"\n✓ Best performer: '{best}' with {scores[best]['accuracy'] * 100:.1f}% accuracy")

# Show some errors for analysis
print("\n--- Sample Errors (for analysis) ---")
for name, result in scores.items():
    errors = [r for r in result["results"] if not r["correct"]][:2]
    if errors:
        print(f"\n{name}:")
        for e in errors:
            print(f"  Expected: {e['expected']}, Got: {e['predicted']}")

## Workflow 4: Framework-Driven (DSPy)

**Reference**: [Stanford NLP DSPy Framework](https://github.com/stanfordnlp/dspy)

**When to use**:
- Complex multi-step workflows
- Need automatic prompt optimization
- Research or advanced applications

DSPy shifts from "prompt engineering" to "prompt programming" - you define **what** you want, not **how** to prompt.

| Traditional Approach | DSPy Approach |
|---------------------|---------------|
| Manual prompt writing | Framework generates prompts |
| Trial and error tuning | Automatic optimization |
| "Classify into categories: BILLING..." | `class Classifier: message -> intent` |

**Key Concepts:**

1. **Signatures** - Define input/output contract (not the prompt itself)
   ```python
   class QA(dspy.Signature):
       question: str = dspy.InputField()
       answer: str = dspy.OutputField()
   ```

2. **Modules** - Reasoning patterns applied automatically
   ```python
   qa = dspy.ChainOfThought(QA)  # DSPy adds CoT prompting
   ```

3. **Optimizers** - Learn better prompts from training examples
   ```python
   optimizer = BootstrapFewShot(metric=accuracy)
   optimized_qa = optimizer.compile(qa, trainset=examples)
   ```

**What the optimizer does automatically:**
- Selects best few-shot examples from your data
- Tunes instruction phrasing
- Finds effective prompt structure for your task

<div class="alert alert-block alert-info">
<b>Note:</b> DSPy requires significant setup and is best suited for research or production systems with complex multi-step pipelines. For most use cases, manual iteration or automated evaluation workflows are sufficient.
</div>

---

# Key Principles

Before applying prompt engineering techniques, keep these principles in mind:

## Principle 1: Techniques are NOT Progressive

**Common misconception**: "Start with basic prompting, then add CoT, then add self-consistency..."

**Reality**: Select technique based on the problem and need, not a fixed sequence.

## Principle 2: Mix Techniques Freely

Techniques can be combined:
- **CoT + Extended Thinking** = Deep reasoning with structured output
- **Few-shot + Self-Refine** = Quality-controlled generation
- **CoD + Verbalized Sampling** = Fast diverse outputs

## Principle 3: Evaluate Objectively

Always measure before and after applying techniques:
- Define success metrics upfront
- Use representative test cases
- Track cost alongside quality
- Document what works for future reference

## Principle 4: Consider Cost/Latency Trade-offs

| Technique | Quality Impact | Cost Impact | Latency Impact |
|-----------|---------------|-------------|----------------|
| CoT | High | Medium | Medium |
| Extended Thinking | Very High | High | High |
| Self-Refine | High | High | High |
| CoD | Medium | Low | Low |

## Principle 5: Start Simple, Add Complexity When Needed

**Recommended approach**:
1. Start with Foundation techniques
2. Test against requirements
3. Identify failure modes
4. Add targeted techniques to address specific failures
5. Avoid over-engineering

---

# Summary

### Techniques Overview

| Category | Technique | Best For | Trade-off |
|----------|-----------|----------|----------|
| **Reasoning** | Chain-of-Thought | Math, logic, multi-step | +tokens |
| **Reasoning** | Extended Thinking | Complex analysis, coding | +thinking budget |
| **Refinement** | Self-Refine | Quality-critical output | 3x calls |
| **Refinement** | Chain-of-Verification | Factual accuracy | 4x calls |
| **Efficiency** | Chain-of-Draft | Production efficiency | Minimal |
| **Diversity** | Verbalized Sampling | Creative diversity | None |

### Optimization Workflows

| Workflow | Best For | Key Approach |
|----------|----------|-------------|
| **Manual Iteration** | Domain expertise, full control | Human expert iteration |
| **LLM-Assisted** | Quick exploration | LLM suggests variations |
| **Automated Selection** | Objective ranking | Evaluate against test dataset |
| **Framework-Driven** | Complex pipelines | DSPy or similar |

### Decision Framework

<div class="alert alert-block alert-success">
<ol>
    <li><b>Start simple</b> - clear instructions + few-shot often sufficient</li>
    <li><b>Add reasoning</b> - CoT for complex logic, extended thinking for deep analysis</li>
    <li><b>Add iteration</b> - only for high-stakes scenarios (3-4x cost)</li>
    <li><b>Evaluate systematically</b> - use automated selection when you have data</li>
</ol>
</div>

---

## Additional Resources

- [Amazon Bedrock Prompt Optimization](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-management-optimize.html)
- [Amazon Bedrock Model Evaluation](https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html)
- [Extended Thinking Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-extended-thinking.html)
- [Prompt Engineering Guidelines](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-engineering-guidelines.html)
- [DSPy Framework](https://github.com/stanfordnlp/dspy)

## References

The prompt engineering concepts in this notebook are adapted from:

- [A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications](https://arxiv.org/pdf/2402.07927v2)
- [Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity](https://arxiv.org/pdf/2510.01171)