# Lab 2: Prompt Chains & Testing Workflows — SOLUTION KEY

This document contains example solutions for all exercises. Your answers may differ — there are many valid approaches.

## Learning Objectives
- Understand and debug multi-step prompt chains
- Design prompt workflows using sequential, fan-out, and iterative patterns
- Critically evaluate biased A/B test designs
- Architect complex multi-step pipelines for real business tasks

**Duration:** 55–65 minutes | **Difficulty:** Intermediate

---

## Part 1: Understanding Prompt Chains

A **prompt chain** breaks a complex task into discrete steps where the output of one step becomes the input for the next.

### Example: Article Writing Chain

**Step 1 — Extract Topics:**
> Extract the 5 most important topics from the following subject: AI in healthcare

*Output:*
> 1. Current adoption rates and growth trajectory
> 2. Primary use cases in clinical settings
> 3. Regulatory landscape and compliance
> 4. Cost-benefit analysis for hospital systems
> 5. Patient outcome improvements backed by studies

**Step 2 — Create Outline** (uses Step 1 output):
> Create a detailed article outline organized around these topics: [Step 1 output]

**Step 3 — Write Introduction** (uses Step 2 output):
> Write a compelling 150-word introduction for an article with this outline: [Step 2 output]

### Why Chains Work
- Each step has a focused, manageable scope
- The AI produces better output with specific, narrow tasks
- You can inspect and fix intermediate results before they cascade
- You can reuse steps across different workflows

---

## Workflow Patterns

### 1. Sequential Pattern
Steps run one after another. Each step’s output feeds the next.
```
[Step 1: Research] → [Step 2: Outline] → [Step 3: Draft] → [Step 4: Polish]
```
**Best for:** Article writing, report generation, email drafting

### 2. Fan-Out Pattern
Same prompt applied to multiple inputs independently.
```
[Input A] → [Same Prompt] → [Result A]
[Input B] → [Same Prompt] → [Result B]
```
**Best for:** Batch analysis, processing multiple documents, scoring candidates

### 3. Iterative Pattern
Same prompt re-applied to progressively refine output.
```
[Draft v1] → [Refine] → [Draft v2] → [Refine] → [Draft v3]
```
**Best for:** Editing, polishing, improving quality

### Combining Patterns
Real workflows often combine patterns. Example hiring pipeline:
1. **Fan-out:** Score each candidate on 3 dimensions
2. **Sequential:** Rank → Select finalists → Generate interview questions → Write memo

---

## Exercise 1: Debug the Broken Chain (10 minutes)

A 4-step prompt chain for generating a product launch email for **CloudSync Pro** has 3 bugs that cascade through the chain.

**Product:** CloudSync Pro — AI-powered cloud storage with smart file organization, 2TB capacity, real-time collaboration, end-to-end encryption, and intelligent search.

---

### The Broken Chain — Bugs Identified

**Step 1: Identify Features**

*Buggy Output:*
> Key Features of SmartFit Pro Fitness Tracker:
> 1. Heart rate monitoring...

**Bug:** The output describes the WRONG PRODUCT (a fitness tracker called SmartFit Pro instead of CloudSync Pro cloud storage). The step completely ignored the actual product in the input and hallucinated features for an unrelated product.

**Corrected prompt for Step 1:**
> List the 5 most compelling features of CloudSync Pro, an AI-powered cloud storage solution with smart file organization, 2TB capacity, real-time collaboration, end-to-end encryption, and intelligent search. Focus on what makes each feature valuable to business users.

**Corrected Step 1 output (from AI tool):**
> Key Features of CloudSync Pro:
> 1. AI-powered smart file organization — automatically categorizes and tags files, saving hours of manual sorting
> 2. 2TB storage capacity — enough for entire team libraries with room to grow
> 3. Real-time collaboration — multiple users can edit documents simultaneously with live sync
> 4. End-to-end encryption — enterprise-grade security ensures data stays protected at rest and in transit
> 5. Intelligent search — natural language queries find files instantly across all folders and file types

---

**Step 2: Prioritize Features**

*Buggy Output:*
> 1. [LOW] Water resistance — nice-to-have
> 2. [LOW] GPS mapping — niche appeal
> ...
> 5. [HIGH] AI-powered insights — unique selling point

**Bug:** The priority order is REVERSED — the lowest-importance features are listed first (positions 1-2) and the highest-importance features are listed last (positions 4-5). Since the email draft step will "lead with the top features," it will lead with the weakest ones.

**Corrected prompt for Step 2:**
> Rank the following features of CloudSync Pro by customer impact, with the HIGHEST importance first. For each feature, explain why it ranks where it does:
> [paste Step 1 output]

**Corrected Step 2 output:**
> Prioritized Features (highest impact first):
> 1. [HIGH] End-to-end encryption — #1 concern for enterprise buyers; non-negotiable for regulated industries
> 2. [HIGH] AI-powered smart file organization — unique differentiator; saves measurable time daily
> 3. [HIGH] Real-time collaboration — essential for remote/hybrid teams; directly competes with Google Drive
> 4. [MED] Intelligent search — strong value-add; reduces file-finding time from minutes to seconds
> 5. [MED] 2TB storage capacity — competitive baseline; expected but not a differentiator

---

**Step 3: Draft Email**

*Buggy Output:*
> Subject: Exciting News!
> Hi there,
> We have something new for you. Our product is great and you should buy it...

**Bug:** The draft completely ignores the prioritized features from Step 2. It produces generic marketing copy with zero specifics — no product name, no features, no differentiators. It could be about any product.

**Corrected prompt for Step 3:**
> Write a product launch announcement email for CloudSync Pro. Lead with the top 2 features from this prioritized list, mention all 5, and include specific benefits for each:
> [paste Step 2 output]
>
> Format: Compelling subject line, personal greeting, 3 paragraphs (lead feature, supporting features, invitation to try), professional sign-off.

**Corrected Step 3 output:**
> Subject: Introducing CloudSync Pro — Cloud Storage That Thinks for You
>
> Hi [Name],
>
> What if your cloud storage could organize itself? CloudSync Pro combines enterprise-grade end-to-end encryption with AI-powered file organization that automatically categorizes, tags, and sorts your documents — so you spend time working, not searching.
>
> But that’s just the start. CloudSync Pro also features real-time collaboration for seamless teamwork, intelligent natural-language search that finds any file in seconds, and a generous 2TB of storage for your entire team. Whether you’re in a regulated industry that demands bulletproof security or a fast-moving startup that needs instant access to everything, CloudSync Pro delivers.
>
> We’d love for you to experience the difference firsthand.
>
> Best regards,
> The CloudSync Team

**Step 4 (working correctly) then adds the CTA:**
> ---
> LIMITED TIME: Pre-order now at 25% off. Use code LAUNCH25 at checkout. Offer expires March 15th.
> >> Pre-order Now: https://example.com/launch <<

## Part 2: Prompt Testing & A/B Comparison

Professional prompt engineers test prompts systematically:
1. Define test cases with expected outputs
2. Run both prompts against the same inputs
3. Score outputs on consistent dimensions
4. Compare results in a structured table

### Evaluation Dimensions

| Dimension | What to Look For |
|-----------|------------------|
| **Relevance** | Does the output directly address the input? |
| **Completeness** | Does it cover all requested points? |
| **Format Compliance** | Does it follow the requested structure? |
| **Consistency** | Is the quality consistent across different inputs? |

---

## Exercise 2: Design a Feedback Pipeline (15 minutes)

Process 5 customer feedback items through a multi-step prompt workflow.

### The Feedback Items
1. "The export feature crashes every time I try to save as PDF. This is blocking my entire team's workflow!!!"
2. "Love the new dashboard redesign! The charts are so much clearer and the dark mode option is fantastic."
3. "It would be great if you could add integration with Slack so we get notifications when reports are ready."
4. "Your billing system charged me twice this month. I need an immediate refund. This is unacceptable and I'm considering switching to a competitor."
5. "The search function is slow when filtering by date range. Takes about 10 seconds to load results for large datasets."

---

### Step 1: Categorization Prompt

**Prompt used:**
> Categorize the following customer feedback item into exactly one category: Bug, Feature Request, Praise, or Complaint. Respond with only the category name and a one-sentence justification.
>
> Feedback: "{feedback text}"

**Results:**

| # | Feedback (first 50 chars) | Category |
|---|--------------------------|----------|
| 1 | The export feature crashes every time... | **Bug** — Reports a crash (broken functionality) |
| 2 | Love the new dashboard redesign... | **Praise** — Positive feedback on existing features |
| 3 | It would be great if you could add... | **Feature Request** — Suggests new Slack integration |
| 4 | Your billing system charged me twice... | **Complaint** — Reports billing error with demand for action |
| 5 | The search function is slow when... | **Bug** — Reports performance issue (slow loading) |

---

### Step 2: Urgency Scoring Prompt

**Prompt used:**
> Score the urgency of the following customer feedback on a scale of 1-5:
> - 1 = Low (nice to know)
> - 2 = Minor (address when convenient)
> - 3 = Moderate (address this sprint)
> - 4 = High (address this week)
> - 5 = Critical (address immediately — revenue or customer retention at risk)
>
> Consider: Is functionality blocked? Is money involved? Is the customer threatening to leave? How many people are affected?
>
> Feedback: "{feedback text}"

**Results:**

| # | Feedback (first 50 chars) | Category | Urgency |
|---|--------------------------|----------|--------|
| 1 | The export feature crashes every time... | Bug | **5/5** — Blocking entire team |
| 2 | Love the new dashboard redesign... | Praise | **1/5** — Positive, no action needed |
| 3 | It would be great if you could add... | Feature Request | **2/5** — Nice to have, not blocking |
| 4 | Your billing system charged me twice... | Complaint | **5/5** — Money involved, churn risk |
| 5 | The search function is slow when... | Bug | **3/5** — Degraded but not blocked |

---

### Step 3: Response Planning Prompt

**Prompt used:**
> Based on the following categorized and urgency-scored customer feedback, create an action plan. Prioritize by urgency (highest first). For each item, specify: the responsible team, the action to take, and a target resolution time.
>
> 1. Bug (Urgency 5/5): Export crashes when saving as PDF — blocking team workflow
> 2. Praise (Urgency 1/5): Loves dashboard redesign and dark mode
> 3. Feature Request (Urgency 2/5): Wants Slack integration for report notifications
> 4. Complaint (Urgency 5/5): Double-charged on billing, demands refund, threatening to switch
> 5. Bug (Urgency 3/5): Search slow with date filters on large datasets (10s load time)

**AI-generated action plan:**

> ### Customer Feedback Action Plan (Priority Order)
>
> **CRITICAL — Resolve Within 24 Hours:**
>
> | # | Issue | Team | Action | Target |
> |---|-------|------|--------|--------|
> | 4 | Double billing charge | Billing + Support | Issue immediate refund, send apology email with goodwill credit, investigate root cause | Today |
> | 1 | PDF export crash | Engineering | Hotfix for export module, notify affected users when resolved | 24 hours |
>
> **MODERATE — Resolve This Sprint:**
>
> | # | Issue | Team | Action | Target |
> |---|-------|------|--------|--------|
> | 5 | Slow date-range search | Engineering (Backend) | Profile query performance, add database index for date filters | This sprint |
>
> **LOW — Backlog:**
>
> | # | Issue | Team | Action | Target |
> |---|-------|------|--------|--------|
> | 3 | Slack integration request | Product | Add to feature backlog, survey other users for demand validation | Q2 planning |
> | 2 | Dashboard praise | Marketing + Product | Share with team for morale, use as testimonial (with permission) | Ongoing |

---

### Summary

**What I learned about multi-step workflows:** Breaking the feedback processing into categorize → score → plan made each step clearer and more consistent. If I had tried to do all three in one prompt, the AI would have mixed up categories and urgency scores.

**Hardest step:** The urgency scoring prompt required the most iteration. My first version produced inconsistent scores because I didn't define what each number meant. Adding the 1-5 rubric directly in the prompt fixed this.

## Exercise 3: Expose the Rigged A/B Test (15 minutes)

An A/B test below compares two prompts for product descriptions and concludes Prompt B wins dramatically. But the test is rigged.

**Prompt A:** "Describe this product: {product}"
**Prompt B:** (full CRAFT version with context, role, etc.)

**Rigged Results:**

| Product | Prompt A | Prompt B |
|---------|---------|--------|
| NovaBuds Pro earbuds | 6/20 | 19/20 |
| ErgoRise laptop stand | 5/20 | 18/20 |
| HydroTrack water bottle | 7/20 | 20/20 |
| **Total** | **18/60** | **57/60** |

---

### Part 1: Biases Identified

**Bias 1: The expected keywords were copied directly from Prompt B's outputs.** The test cases use words like "immerse," "crystal-clear," "premium," "game-changer" as expected keywords — these are the exact words used in Prompt B's simulated outputs. Prompt A was doomed to score low on completeness because the "correct" answers were written to match Prompt B.

**Bias 2: Prompt A's simulated output is deliberately terrible.** The simulated output for Prompt A is the same generic 4 sentences for every product: "This is a good product. It works well and looks nice." A real AI given Prompt A would produce a much more useful description than this. The simulation makes Prompt A look worse than it actually is.

**Bias 3: Prompt B's simulated output is keyword-stuffed to maximize scoring.** Prompt B's outputs were specifically written to include every expected keyword from the test cases. This isn't how a real AI would respond — it's a simulation designed to get the highest possible completeness score.

**Bias 4: The expected format patterns favor Prompt B's structure.** The format regex patterns (like `(hook|bullet|feature).*CTA`) are designed around Prompt B's requested format. Prompt A never asked for hooks and CTAs, so it's penalized for not matching a format it never requested.

---

### Part 2: Fair Test Design

**Test products:**
1. Sony WH-1000XM5 headphones
2. Anker 737 portable charger
3. Kindle Paperwhite e-reader

**Fair evaluation criteria:**

| Dimension | What to Look For |
|-----------|------------------|
| Accuracy | Are the product features described correctly? |
| Persuasiveness | Would a customer want to buy after reading this? |
| Completeness | Does it cover key features, benefits, and use cases? |
| Readability | Is it well-organized and easy to scan? |

---

### Part 3: Fair Test Results

**Product 1: Sony WH-1000XM5 headphones**

*Prompt A output:*
> The Sony WH-1000XM5 headphones deliver exceptional noise cancellation and audio quality. With 30-hour battery life, comfortable lightweight design, and multipoint Bluetooth connectivity, they're ideal for commuters, travelers, and work-from-home professionals. The adaptive sound control automatically adjusts to your environment.

*Prompt B output:*
> Looking for headphones that silence the world? The Sony WH-1000XM5 are the gold standard in noise cancellation:
> - Industry-leading ANC with 8 microphones and Auto NC Optimizer
> - 30-hour battery with quick charging (3 min = 3 hours)
> - Speak-to-Chat pauses music when you talk
>
> Whether you're on a cross-country flight or in a noisy open office, these headphones create your personal sound sanctuary. Experience audio the way the artist intended.

| Dimension | Prompt A | Prompt B |
|-----------|---------|--------|
| Accuracy | 4/5 | 5/5 |
| Persuasiveness | 3/5 | 4/5 |
| Completeness | 3/5 | 4/5 |
| Readability | 3/5 | 5/5 |
| **Total** | **13/20** | **18/20** |

**Product 2: Anker 737 — Prompt A: 14/20, Prompt B: 17/20**

**Product 3: Kindle Paperwhite — Prompt A: 13/20, Prompt B: 16/20**

---

### Fair Test Final Results

| Product | Prompt A | Prompt B |
|---------|---------|--------|
| Sony headphones | 13/20 | 18/20 |
| Anker charger | 14/20 | 17/20 |
| Kindle Paperwhite | 13/20 | 16/20 |
| **Total** | **40/60** | **51/60** |

**Does Prompt B still win?** Yes — Prompt B still wins in a fair test, which makes sense because CRAFT prompts genuinely produce better structured output. But the margin is 40 vs 51 (85% vs 68%), not 18 vs 57 (95% vs 30%). The rigged test made the gap look 3x larger than reality.

**What the biases hid:** Prompt A is actually decent — it produces usable descriptions. The rigged test made it look incompetent by using artificially bad simulated outputs and scoring criteria designed to match only Prompt B's style.

## Exercise 4: Workflow Architect — Hiring Pipeline (15 minutes)

Design a multi-step prompt workflow to evaluate 4 candidates:

1. **Alex Chen** — 8 years Python/ML, built recommendation systems. Quiet in interviews.
2. **Jordan Rivera** — 3 years, bootcamp grad. Articulate presenter. Built open-source tool.
3. **Sam Patel** — 12 years full-stack, led teams of 10+. Strong opinions, sometimes clashes.
4. **Morgan Kim** — 5 years, PhD in NLP. 4 published papers. Limited industry experience.

---

### Step 1: Scoring Prompt (Fan-Out)

**Prompt used:**
> Score the following job candidate on three dimensions, each 1-5:
> - **Technical Skills** (1=entry level, 5=expert): Consider years of experience, complexity of projects, depth of expertise
> - **Communication** (1=poor, 5=excellent): Consider presentation ability, written communication, interview performance
> - **Culture Fit** (1=poor, 5=excellent): Consider collaboration style, mentorship orientation, team dynamics
>
> Provide a one-sentence justification for each score.
>
> Candidate: "{profile}"

**Results:**

| Candidate | Technical (1-5) | Communication (1-5) | Culture Fit (1-5) | Total (/15) |
|-----------|----------------|--------------------|--------------------|-------------|
| Alex Chen | 5 | 2 | 3 | **10** |
| Jordan Rivera | 2 | 5 | 4 | **11** |
| Sam Patel | 5 | 3 | 2 | **10** |
| Morgan Kim | 4 | 4 | 4 | **12** |

---

### Step 2: Ranking Prompt

**Prompt used:**
> Rank the following candidates from highest to lowest total score. If there's a tie, use Culture Fit as the tiebreaker (higher is better). Explain the ranking.
>
> [scores table from Step 1]

**Ranking output:**
> 1. **Morgan Kim** (12/15) — Most balanced candidate. Strong across all three dimensions with no score below 4.
> 2. **Jordan Rivera** (11/15) — Excellent communicator with strong culture fit, but lower technical score reflects limited experience.
> 3. **Alex Chen** (10/15, Culture Fit: 3) — Top technical skills but communication is a concern. Culture fit tiebreaker over Sam.
> 4. **Sam Patel** (10/15, Culture Fit: 2) — Deep technical expertise but peer conflict history is a red flag.

---

### Step 3: Selection Prompt

**Prompt used:**
> Based on the ranking below, select the top 2 finalists. Explain why each was selected and what risks to monitor.
> [ranking from Step 2]

**Selection output:**
> **Finalist 1: Morgan Kim** — Selected for the most balanced profile. Strong technical foundation (PhD + publications) combined with good communication and culture fit. Risk: Limited industry experience may mean slower ramp-up.
>
> **Finalist 2: Jordan Rivera** — Selected for exceptional communication and culture fit, which are harder to teach than technical skills. Risk: Technical gap may require 3-6 months of mentored ramp-up.

---

### Step 4: Interview Questions Prompt

**Prompt used:**
> For each finalist below, generate 2 interview questions that probe their weakest dimension. The questions should give them a chance to demonstrate hidden strength OR confirm the concern.
>
> Finalist 1: Morgan Kim — Weakest dimension: Technical Skills (4/5, limited industry experience)
> Finalist 2: Jordan Rivera — Weakest dimension: Technical Skills (2/5, 3 years experience)

**Interview questions:**

> **Morgan Kim** (probing industry experience):
> - Q1: "Describe a time you had to adapt your academic research approach to meet a real-world deadline or business constraint. What trade-offs did you make?"
> - Q2: "Walk me through how you would design a production ML pipeline for a recommendation system — not the model, but the infrastructure, monitoring, and deployment."
>
> **Jordan Rivera** (probing technical depth):
> - Q1: "Your open-source CLI tool is impressive. Walk me through the hardest technical decision you made during its development and what you'd change with more experience."
> - Q2: "Given a dataset of 1 million customer interactions, describe how you would build a basic predictive model. What tools would you use and what pitfalls would you watch for?"

---

### Step 5: Recommendation Memo

**Prompt used:**
> Write a hiring recommendation memo based on the following evaluation. Include: recommended hire, backup candidate, reasoning, and onboarding suggestions.
> [all previous outputs]

**Final recommendation memo:**
> ### Hiring Recommendation — Senior AI Engineer Position
>
> **Recommended Hire: Morgan Kim**
> Morgan offers the strongest overall profile with a 12/15 score and no dimension below 4. Her PhD in NLP and 4 published papers on transformer architectures provide a deep theoretical foundation that will be immediately relevant to our AI roadmap. While her industry experience is limited, her strong communication skills and collaborative style suggest she will ramp quickly with structured onboarding.
>
> **Backup: Jordan Rivera**
> Jordan's exceptional communication (5/5) and culture fit (4/5) make them an excellent long-term investment, particularly if we can provide the mentorship they've requested. Their technical gap (2/5) is the most significant concern, but their open-source contributions demonstrate self-directed learning ability.
>
> **Onboarding Recommendations:**
> - Pair Morgan with a senior engineer for the first 60 days to bridge the academic-to-industry gap
> - If hiring Jordan, establish a 90-day technical development plan with weekly check-ins
> - Both candidates would benefit from a "buddy system" with current team members

## Key Takeaways

### Prompt Engineering Best Practices Checklist

- Use the CRAFT framework for every important prompt
- Break complex tasks into chains — don't ask the AI to do everything at once
- Inspect intermediate outputs before they cascade
- Test prompts systematically with multiple inputs
- Watch for A/B test biases
- Adapt prompts for your audience
- Build a template library
- Score prompts with a rubric — but remember rubrics have blind spots

### When to Use Each Pattern

| Pattern | Best For | Example |
|---------|---------|--------|
| **Sequential** | Multi-step processes | Research → Outline → Draft → Edit |
| **Fan-Out** | Same task, many items | Scoring candidates, analyzing documents |
| **Iterative** | Progressive refinement | Draft → Improve → Polish |
| **Combined** | Complex real-world tasks | Fan-out scoring + sequential ranking |

---

## Final Reflection — Example Answers

**1. Most surprising thing about prompt chains?**
The biggest surprise is how much cascade failures matter. A small error in Step 1 (wrong product features) completely ruined the final email — it wasn't just slightly wrong, it was about the wrong product entirely. This shows why inspecting intermediate outputs is critical.

**2. A daily work task that could benefit from a multi-step workflow:**
- Step 1: Extract key points from meeting transcript (focus/narrow)
- Step 2: Categorize points into Decisions, Action Items, and Open Questions (structure)
- Step 3: Format as a team email with owners and deadlines (audience-tailor)
- Step 4: Generate follow-up questions for unresolved items (extend)

**3. One A/B test bias to watch for:**
Using expected keywords that are copied from one prompt's output. This guarantees that prompt will score higher on "completeness" regardless of actual quality. Fair tests need neutral evaluation criteria that don't favor either prompt's style.

**4. One piece of prompt engineering advice:**
Break complex requests into chains of 2-4 focused prompts. A single prompt trying to do everything produces mediocre results across the board. Three focused prompts each doing one thing well will always beat one prompt trying to do three things at once.

---

*Solution key complete. Remember: these are example answers — many valid approaches exist for each exercise.*

---