# Pattern 23: Multi-Agent Collaboration

## Learning Objectives

By completing this tutorial, you will:
- Understand multi-agent collaboration patterns (parallel vs. sequential)
- Learn how ReviewerPanel orchestrates 6 specialist reviewers
- Master parallel execution with `asyncio.gather()` for 6x speedup
- Practice designing specialist vs. adversarial reviewer personas
- Implement Secretary consolidation of diverse feedback
- Profile performance gains from parallelization

## Prerequisites

- **Python**: Intermediate proficiency with async/await
- **LLM APIs**: Gemini API key (set in `.env` file)
- **Composable App**: Familiarity with AbstractWriter and Article dataclass
- **Recommended**: Complete [Reflection Pattern Tutorial](../concepts/reflection_pattern.md) first

## Estimated Time

- **Reading**: 20-25 minutes
- **Hands-on exercises**: 15-20 minutes
- **Total**: 35-45 minutes

## Cost Warning

‚ö†Ô∏è **API Cost Estimate**: $0.20 - $0.40

This notebook makes the following LLM API calls:
- **6 parallel reviewer calls** (~1,500 tokens input √ó 6 = 9,000 tokens)
- **1 secretary consolidation** (~3,000 tokens input)
- **1 writer revision** (~2,000 tokens input)

**Total**: ~14,000 input tokens + ~2,000 output tokens

**Gemini 2.0 Flash pricing** (as of 2025):
- Input: $0.075 per 1M tokens
- Output: $0.30 per 1M tokens

**Estimated cost per run**: (14,000 √ó $0.075 / 1M) + (2,000 √ó $0.30 / 1M) ‚âà **$0.00165**

If you run all exercises 10 times: **~$0.02**

**Cost savings tip**: Use pre-executed outputs provided in this notebook to learn without API calls.

## Book Reference

> **Pattern 23: Multi-Agent Collaboration** is detailed in *Generative AI Design Patterns* (Lakshmanan & Hapke, O'Reilly 2025), Chapter on "Multi-Agent Systems and Orchestration" (pages TBD)
>
> Related patterns:
> - **Pattern 18 (Reflection)**: How agents incorporate feedback
> - **Pattern 19 (Dependency Injection)**: ReviewerPanel uses DI for agent composition
> - **Pattern 25 (Prompt Caching)**: Reusing system prompts across reviewers

---

## Setup and Imports

First, let's set up our environment and import the necessary modules.

In [None]:
# Standard library imports
import asyncio
import time
import os
import sys
from pathlib import Path
from dataclasses import replace

# Add composable_app to path
composable_app_path = Path.cwd().parent.parent
if str(composable_app_path) not in sys.path:
    sys.path.insert(0, str(composable_app_path))

# Import composable app modules
from composable_app.agents.article import Article
from composable_app.agents.generic_writer_agent import (
    Writer,
    WriterFactory,
    AbstractWriter,
    GenAIWriter
)
from composable_app.agents.reviewer_panel import ReviewerPanel
from composable_app.utils.prompt_service import PromptService

print("‚úÖ Imports successful")
print(f"üìÅ Working directory: {Path.cwd()}")
print(f"üì¶ Composable app path: {composable_app_path}")

In [None]:
# Check for API key
from dotenv import load_dotenv

# Load environment variables from .env file
env_path = composable_app_path / ".env"
if env_path.exists():
    load_dotenv(env_path)
    print(f"‚úÖ Loaded .env from {env_path}")
else:
    print(f"‚ö†Ô∏è  No .env file found at {env_path}")
    print("   Create one with: GEMINI_API_KEY=your_key_here")

# Validate API key
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
    print("‚ùå GEMINI_API_KEY not found in environment")
    print("   Set it in .env file or export GEMINI_API_KEY='your_key'")
    raise ValueError("Missing GEMINI_API_KEY")
elif api_key.startswith("your_") or len(api_key) < 10:
    print("‚ùå GEMINI_API_KEY looks invalid (placeholder or too short)")
    print(f"   Current value: {api_key[:20]}...")
    raise ValueError("Invalid GEMINI_API_KEY")
else:
    print(f"‚úÖ GEMINI_API_KEY found (length: {len(api_key)})")
    print(f"   Key preview: {api_key[:10]}...{api_key[-5:]}")

In [None]:
# Test import of ReviewerPanel components
try:
    from composable_app.agents.reviewer_panel import (
        ReviewerPanel,
        GrammarReviewer,
        MathReviewer,
        DistrictRepReviewer,
        ConservativeParentReviewer,
        LiberalParentReviewer,
        SchoolAdminReviewer,
        SecretaryReviewer
    )
    print("‚úÖ All reviewer classes imported successfully")
    print("   - GrammarReviewer")
    print("   - MathReviewer")
    print("   - DistrictRepReviewer")
    print("   - ConservativeParentReviewer")
    print("   - LiberalParentReviewer")
    print("   - SchoolAdminReviewer")
    print("   - SecretaryReviewer")
except ImportError as e:
    print(f"‚ùå Failed to import reviewer classes: {e}")
    print("   Make sure composable_app/agents/reviewer_panel.py exists")
    raise

### ‚úÖ Setup Complete!

If all cells above executed without errors, you're ready to proceed.

**What we've verified**:
- ‚úÖ Python environment with async support
- ‚úÖ Composable app modules accessible
- ‚úÖ Gemini API key configured
- ‚úÖ All reviewer classes available

**Cost reminder**: Running the exercises below will cost approximately **$0.20-$0.40** in API calls.

---

## Part 1: Understanding Multi-Agent Collaboration

### What is Multi-Agent Collaboration?

**Multi-agent collaboration** is a design pattern where multiple AI agents work together to solve a problem that would be difficult for a single agent. Each agent specializes in a specific aspect of the task.

**Key characteristics**:
- üéØ **Specialization**: Each agent has a specific role (e.g., grammar review, math accuracy)
- üîÑ **Coordination**: Agents communicate and combine their outputs
- ‚ö° **Parallelization**: Independent agents can run simultaneously
- üé≠ **Diversity**: Different perspectives lead to better outcomes

### Real-World Analogy

Think of a **manuscript review process**:

```
Single Expert Review (Sequential)
‚îú‚îÄ Editor reads manuscript (30 min)
‚îú‚îÄ Technical reviewer checks facts (30 min)
‚îú‚îÄ Style reviewer checks writing (30 min)
‚îî‚îÄ Legal reviewer checks compliance (30 min)
Total: 2 hours (sequential)

Multi-Expert Panel (Parallel)
‚îú‚îÄ Editor reads manuscript ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îú‚îÄ Technical reviewer checks facts ‚îÄ‚î§
‚îú‚îÄ Style reviewer checks writing ‚îÄ‚îÄ‚îÄ‚î§‚îÄ‚îÄ All happen simultaneously
‚îî‚îÄ Legal reviewer checks compliance ‚îò
Total: 30 minutes (parallel) ‚úÖ 4x faster!
```

### Multi-Agent Patterns in LLM Applications

There are two main orchestration patterns:

#### 1. Sequential Pattern (Chain of Agents)

```
Agent 1 ‚Üí Output 1 ‚Üí Agent 2 ‚Üí Output 2 ‚Üí Agent 3 ‚Üí Final Output
```

**When to use**:
- ‚úÖ Each agent depends on previous agent's output
- ‚úÖ Clear workflow order (e.g., Draft ‚Üí Review ‚Üí Revise)
- ‚úÖ Example: Writing pipeline (Researcher ‚Üí Writer ‚Üí Editor)

**Trade-offs**:
- ‚ùå Slower (total time = sum of all agents)
- ‚úÖ Simpler to debug (linear flow)
- ‚ùå Bottleneck if one agent is slow

#### 2. Parallel Pattern (Panel of Agents)

```
            ‚îå‚îÄ Agent 1 ‚Üí Output 1 ‚îÄ‚îê
Input ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ Agent 2 ‚Üí Output 2 ‚îÄ‚îº‚îÄ‚Üí Aggregator ‚Üí Final Output
            ‚îî‚îÄ Agent 3 ‚Üí Output 3 ‚îÄ‚îò
```

**When to use**:
- ‚úÖ Agents are independent (don't need each other's outputs)
- ‚úÖ Want diverse perspectives on same input
- ‚úÖ Example: Review panel (6 reviewers evaluate same article)

**Trade-offs**:
- ‚úÖ Faster (total time = max of any single agent)
- ‚úÖ More robust (one agent failure doesn't stop others)
- ‚ùå More complex (need aggregation logic)
- ‚ùå Higher API cost (multiple simultaneous calls)

### ComposableApp's Multi-Agent Architecture

The **ReviewerPanel** implements the **parallel pattern**:

```
                        ‚îå‚îÄ GrammarReviewer ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                        ‚îú‚îÄ MathReviewer ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
Article (draft) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ DistrictRepReviewer ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§‚îÄ‚Üí SecretaryReviewer ‚Üí Consolidated Feedback
                        ‚îú‚îÄ ConservativeParent ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
                        ‚îú‚îÄ LiberalParent ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
                        ‚îî‚îÄ SchoolAdminReviewer ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                        
                        All 6 run in parallel (asyncio.gather)
```

**Why this design?**:
1. **Independence**: Each reviewer evaluates the article independently
2. **Diversity**: 6 different perspectives (3 specialist, 3 adversarial)
3. **Speed**: Parallel execution = 6x faster than sequential
4. **Robustness**: One reviewer failure doesn't stop others

---

### Specialist vs. Adversarial Reviewers

The ReviewerPanel uses two types of agents:

#### Specialist Reviewers (Domain Experts)

**Purpose**: Ensure technical accuracy and quality

1. **GrammarReviewer**
   - Checks spelling, grammar, punctuation
   - Ensures grade-appropriate reading level
   - Validates sentence structure

2. **MathReviewer**
   - Verifies mathematical accuracy
   - Checks formulas and calculations
   - Ensures proper notation

3. **DistrictRepReviewer**
   - Validates alignment with curriculum standards
   - Checks learning objectives coverage
   - Ensures educational value

**Example feedback**:
```
Grammar: "Paragraph 2, sentence 3: 'photosynthesis occur' should be 'photosynthesis occurs'"
Math: "Equation on line 5 is incorrect: should be E=mc¬≤ not E=mc"
District: "Missing connection to 9th grade biology standard BS.9.2.1"
```

#### Adversarial Reviewers (Stakeholder Perspectives)

**Purpose**: Identify potential controversies or concerns

4. **ConservativeParentReviewer**
   - Flags content that conservative parents might object to
   - Checks for political bias (left-leaning)
   - Ensures traditional values alignment

5. **LiberalParentReviewer**
   - Flags content that liberal parents might object to
   - Checks for political bias (right-leaning)
   - Ensures inclusive language

6. **SchoolAdminReviewer**
   - Identifies potential legal/liability issues
   - Checks compliance with school policies
   - Flags controversial topics

**Example feedback**:
```
Conservative: "Evolution section may be controversial in some districts. Consider adding 'theory' qualifier."
Liberal: "Gender roles example in paragraph 4 may be seen as stereotypical. Use more inclusive examples."
School Admin: "Chemistry experiment on page 3 requires safety warning and parental consent."
```

---

### Why Use Adversarial Reviewers?

**Traditional approach**: Only use expert reviewers

```
Article ‚Üí Grammar + Math + Curriculum Review ‚Üí Publish
‚Üì
Controversy erupts after publication ‚ùå
```

**Multi-agent approach**: Include adversarial perspectives

```
Article ‚Üí Specialist Reviews + Adversarial Reviews ‚Üí Identify issues early ‚Üí Fix before publish
‚Üì
Minimal controversy, broader acceptance ‚úÖ
```

**Benefits**:
- üõ°Ô∏è **Risk mitigation**: Catch controversial content before publication
- üéØ **Broader appeal**: Address concerns from diverse stakeholders
- üìä **Better decisions**: Surface trade-offs between competing values
- üí∞ **Cost savings**: Fixing issues pre-publication is cheaper than post-publication crisis management

**Real-world example**:
A science textbook included an experiment that was educationally sound but required chemicals unavailable to low-income schools. The **SchoolAdminReviewer** flagged this equity issue, leading to an alternative experiment that all schools could perform.

---

### Performance: Sequential vs. Parallel

Let's analyze the performance difference:

**Sequential execution** (one reviewer at a time):
```python
total_time = 0
for reviewer in reviewers:
    start = time.time()
    feedback = await reviewer.review(article)
    elapsed = time.time() - start
    total_time += elapsed
    
# If each reviewer takes 5 seconds:
# Total = 6 √ó 5s = 30 seconds
```

**Parallel execution** (all reviewers simultaneously):
```python
start = time.time()
feedbacks = await asyncio.gather(
    reviewer1.review(article),
    reviewer2.review(article),
    reviewer3.review(article),
    reviewer4.review(article),
    reviewer5.review(article),
    reviewer6.review(article),
)
total_time = time.time() - start

# Total = max(5s) = 5 seconds
# Speedup: 30s / 5s = 6x faster! üöÄ
```

**When parallelization helps**:
- ‚úÖ **I/O-bound operations** (LLM API calls, database queries)
- ‚úÖ **Independent tasks** (reviewers don't need each other's output)
- ‚úÖ **Multiple agents** (more agents = more speedup)

**When it doesn't help**:
- ‚ùå **CPU-bound operations** (local computation already maxes CPU)
- ‚ùå **Dependent tasks** (agent 2 needs agent 1's output)
- ‚ùå **Rate limits** (API restricts concurrent requests)

---

### Secretary Pattern: Consolidating Feedback

After 6 reviewers provide feedback, we need to **consolidate** it into actionable guidance for the writer.

**Problem**: Without consolidation, writer receives:
```
Grammar: "Fix 12 spelling errors"
Math: "Equation 3 is wrong"
District: "Add learning objective"
Conservative: "Remove controversial example"
Liberal: "Add inclusive language"
Admin: "Add safety warning"

Writer: "I have 6 different priorities. Which should I focus on first?" üòï
```

**Solution**: **SecretaryReviewer** consolidates feedback

**Secretary's role**:
1. **Synthesize**: Combine related feedback (e.g., grammar + style)
2. **Prioritize**: Rank feedback by importance (critical errors first)
3. **Resolve conflicts**: When reviewers disagree, provide balanced guidance
4. **Simplify**: Convert technical feedback into clear action items

**Example consolidated feedback**:
```
CRITICAL (Fix immediately):
- Equation in paragraph 3 is mathematically incorrect (MathReviewer)
- Missing safety warning for chemistry experiment (SchoolAdminReviewer)

HIGH PRIORITY (Fix before review):
- 12 spelling/grammar errors throughout (GrammarReviewer)
- Learning objective not explicitly stated (DistrictRepReviewer)

OPTIONAL (Consider for revision):
- Evolution section may need "theory" qualifier for broader acceptance (ConservativeParent)
- Gender example could be more inclusive (LiberalParent)
```

**Implementation** (we'll see the code later):
```python
secretary = SecretaryReviewer()
consolidated = await secretary.consolidate(
    article=article,
    reviews=[grammar_feedback, math_feedback, ...]
)
```

---

### Key Takeaways

‚úÖ **Multi-agent collaboration** = multiple specialized agents working together
‚úÖ **Parallel pattern** = agents run simultaneously (faster, more robust)
‚úÖ **Sequential pattern** = agents run one after another (simpler, dependent tasks)
‚úÖ **Specialist reviewers** = domain experts (grammar, math, curriculum)
‚úÖ **Adversarial reviewers** = stakeholder perspectives (parents, admin)
‚úÖ **Secretary pattern** = consolidates diverse feedback into actionable guidance
‚úÖ **Performance**: Parallel = 6x faster for 6 independent agents

**Next**: We'll see the actual ReviewerPanel implementation in code!

---

# Performance Comparison Summary
import pandas as pd

# Collect all metrics
all_metrics = [metrics_sequential, metrics_parallel, metrics_batched]

# Create comparison table
comparison_data = []
for m in all_metrics:
    comparison_data.append({
        "Strategy": m.strategy,
        "Total Time (s)": f"{m.total_time:.2f}",
        "Reviews": m.num_reviews,
        "Speedup": f"{m.speedup:.2f}x",
        "Throughput (rev/s)": f"{m.throughput:.2f}"
    })

df = pd.DataFrame(comparison_data)

print("PERFORMANCE COMPARISON SUMMARY")
print("=" * 70)
print(df.to_string(index=False))

print("\n" + "=" * 70)
print("\nüìä Key Insights:")
print(f"  1. Parallel execution is {metrics_parallel.speedup:.1f}x faster than sequential")
print(f"  2. Batched execution balances speed ({metrics_batched.speedup:.1f}x) with rate limit safety")
print(f"  3. All strategies complete the same {metrics_sequential.num_reviews} reviews")
print(f"  4. Cost is IDENTICAL for all strategies (same number of API calls)")

print("\nüí° When to use each strategy:")
print("  ‚Ä¢ Sequential: Debugging, strict rate limits, low concurrency environment")
print("  ‚Ä¢ Parallel: Production with generous rate limits, minimize latency")
print("  ‚Ä¢ Batched: Production with strict rate limits, balance speed and safety")

print("\n‚ö†Ô∏è  Rate Limit Considerations:")
print("  ‚Ä¢ Gemini API: 60 requests/minute (1 request/second)")
print("  ‚Ä¢ 6 parallel reviewers = instant burst, then wait")
print("  ‚Ä¢ Batched (2 at a time) = smoother rate limit compliance")

print("\nüéØ ComposableApp's Choice:")
print("  Current implementation uses SEQUENTIAL execution")
print("  Reason: Simplifies debugging, predictable token usage, avoids rate limits")
print("  Trade-off: Slower user experience (30s vs 5s for 6 reviewers)")

print("\n‚úÖ Exercise: Modify `do_first_round_reviews()` to use `asyncio.gather()`")
print("  See: composable_app/agents/reviewer_panel.py:100-131")

## Part 8: Self-Assessment and Next Steps

Test your understanding of multi-agent collaboration patterns with these questions.

---

### Self-Assessment Questions

<details>
<summary><strong>Question 1: When should you use parallel vs. sequential agent execution?</strong></summary>

**Answer**:

**Use Parallel Execution when**:
- ‚úÖ Tasks are **independent** (agents don't need each other's outputs)
- ‚úÖ Operations are **I/O-bound** (LLM API calls, database queries)
- ‚úÖ You want to **minimize latency** (user waiting time)
- ‚úÖ API rate limits allow concurrent requests

**Example**: Round 1 reviews where each reviewer evaluates independently

**Use Sequential Execution when**:
- ‚úÖ Tasks are **dependent** (agent N needs output from agent N-1)
- ‚úÖ You need **predictable order** for debugging
- ‚úÖ **Strict rate limits** don't allow concurrent requests
- ‚úÖ Want to **minimize complexity** in development

**Example**: Round 2 reviews where reviewers see Round 1 feedback

**Performance trade-off**:
- Parallel: 6 agents = max(5s) = 5 seconds
- Sequential: 6 agents = 6 √ó 5s = 30 seconds

**Cost**: Same (6 API calls either way)
</details>

<details>
<summary><strong>Question 2: What does `return_exceptions=True` do in `asyncio.gather()`, and why is it important?</strong></summary>

**Answer**:

**Without `return_exceptions=True` (default)**:
```python
results = await asyncio.gather(task1, task2, task3)
# If task2 fails, entire gather() raises exception
# task3 may be cancelled, you lose all results
```

**With `return_exceptions=True`**:
```python
results = await asyncio.gather(task1, task2, task3, return_exceptions=True)
# If task2 fails, gather() returns [result1, Exception, result3]
# You get partial results and can handle failure gracefully
```

**Why it's important for multi-agent systems**:
- üõ°Ô∏è **Robustness**: One reviewer failure doesn't stop others
- üìä **Partial results**: Get 5 out of 6 reviews instead of 0
- üêõ **Better debugging**: Can inspect which specific agent failed and why
- üéØ **Production-ready**: Graceful degradation instead of complete failure

**Best practice**:
```python
results = await asyncio.gather(*tasks, return_exceptions=True)
successful = [r for r in results if not isinstance(r, Exception)]
failed = [r for r in results if isinstance(r, Exception)]

for exc in failed:
    logger.error(f"Agent failed: {exc}")
```
</details>

<details>
<summary><strong>Question 3: What are the key differences between specialist and adversarial reviewers?</strong></summary>

**Answer**:

| Aspect | Specialist Reviewers | Adversarial Reviewers |
|--------|---------------------|----------------------|
| **Goal** | Technical accuracy and quality | Identify controversies and concerns |
| **Criteria** | Objective, measurable | Subjective, value-based |
| **Examples** | Grammar, Math, District Rep | Conservative/Liberal Parents, Admin |
| **Feedback type** | "Spelling error on line 5" | "This topic may upset some parents" |
| **Conflicts** | Rarely conflict with each other | Often conflict with each other |
| **Purpose** | Ensure correctness | Surface stakeholder concerns early |

**Why include both**:
- ‚úÖ **Specialist alone**: Content is accurate but may cause controversy
- ‚úÖ **Adversarial alone**: Content is safe but may have factual errors
- ‚úÖ **Both together**: Content is accurate AND broadly acceptable

**Real-world value**:
Adversarial reviewers catch issues that would cause problems after publication, when fixing is much more expensive.
</details>

<details>
<summary><strong>Question 4: What are the three main responsibilities of the Secretary consolidation pattern?</strong></summary>

**Answer**:

The Secretary (meta-reviewer) has three main responsibilities:

**1. Synthesize and Prioritize**
```
Input (raw reviews):
- GrammarReviewer: "12 spelling errors"
- MathReviewer: "Equation on line 15 is wrong"
- DistrictRep: "Article is too long"

Output (prioritized):
CRITICAL: Fix equation (affects learning)
HIGH: Fix 12 spelling errors (professionalism)
MODERATE: Reduce length (budget concern)
```

**2. Resolve Conflicts**
```
Input (conflicting reviews):
- ConservativeParent: "Remove paragraph 3"
- LiberalParent: "Expand paragraph 3"

Output (balanced resolution):
REQUIRES DISCUSSION: Keep paragraph 3 but revise for balance
- Use neutral language (addresses conservative concern)
- Maintain context (addresses liberal concern)
```

**3. Simplify into Action Items**
```
Input (technical reviews):
- GrammarReviewer: "Passive voice in sentences 2, 5, 8..."
- MathReviewer: "Use √ó not * for multiplication"

Output (actionable):
TO-DO:
1. Change line 15: "5 * 3 = 15" ‚Üí "5 √ó 3 = 15"
2. Convert passive voice in 3 sentences to active voice
```

**Why this matters**:
Without consolidation, writer gets 6 potentially conflicting perspectives and doesn't know where to start. With consolidation, writer gets one clear set of prioritized, conflict-resolved action items.
</details>

<details>
<summary><strong>Question 5: How do you design an effective reviewer persona?</strong></summary>

**Answer**:

Apply these **4 design principles**:

**Principle 1: Clear, Focused Mandate**
```
‚úÖ Good: "Check spelling, grammar, punctuation. Report line numbers."
‚ùå Bad: "Review the article for quality"
```

**Principle 2: Specific Values and Priorities**
```
‚úÖ Good: "Focus on Western civilization, downplay colonialism's negative aspects"
‚ùå Bad: "Care about traditional values"
```

**Principle 3: Representative of Real Stakeholders**
```
‚úÖ Good: Based on actual parent feedback patterns in school districts
‚ùå Bad: Stereotype not grounded in real-world concerns
```

**Principle 4: Creates Productive Tension**
```
‚úÖ Good: Conservative wants traditional algorithms, Liberal wants real-world contexts
   (Both valid, forces better solution that balances both)
‚ùå Bad: Reviewer A wants short content, Reviewer B wants long content
   (Artificial conflict, no underlying values)
```

**Testing your persona**:
- Can you explain WHY this reviewer would object to X?
- Does this reviewer catch issues others would miss?
- Would real stakeholders recognize themselves in this persona?
- Does disagreement with other reviewers lead to better content?
</details>

<details>
<summary><strong>Question 6: What's the performance and cost impact of parallel execution?</strong></summary>

**Answer**:

**Performance Impact**:
```
Sequential: 6 reviewers √ó 5s each = 30 seconds
Parallel:   6 reviewers, max(5s)  = 5 seconds
Speedup:    30s / 5s = 6x faster ‚úÖ
```

**Cost Impact**:
```
Sequential: 6 API calls √ó 2,000 tokens = 12,000 tokens
Parallel:   6 API calls √ó 2,000 tokens = 12,000 tokens
Cost difference: $0.00 (same number of API calls) ‚úÖ
```

**Key insight**: Parallel execution is **faster** but **not cheaper**. You're making the same API calls, just concurrently instead of sequentially.

**Trade-offs**:

| Aspect | Sequential | Parallel |
|--------|-----------|----------|
| **Speed** | Slow (30s) | Fast (5s) |
| **Cost** | Same | Same |
| **Complexity** | Simple | More complex |
| **Debugging** | Easy (linear) | Harder (concurrent) |
| **Rate limits** | Gentle | May hit limits |
| **User experience** | Poor (long wait) | Good (short wait) |

**Production recommendation**:
- Use **parallel** for independent tasks (Round 1 reviews)
- Use **sequential** for dependent tasks (Round 2 reviews)
- Use **batched parallel** if rate limits are strict
</details>

<details>
<summary><strong>Question 7: Why is the two-round review pattern beneficial?</strong></summary>

**Answer**:

**Round 1: Independent Reviews**
```python
reviews_so_far = []  # Empty - no knowledge of others
for reviewer in panel:
    review = await reviewer.review(article, reviews_so_far)
```

**Benefits**:
- ‚úÖ Prevents groupthink (reviewers not influenced by others)
- ‚úÖ Captures true diverse perspectives
- ‚úÖ Each reviewer focuses on their specialty without bias

**Round 2: Informed Reviews**
```python
reviews_so_far = round1_reviews  # Can see what others said
for reviewer in panel:
    review = await reviewer.review(article, reviews_so_far)
```

**Benefits**:
- ‚úÖ Reviewers can respond to each other's points
- ‚úÖ Conflicts surface explicitly (conservative vs liberal)
- ‚úÖ More nuanced feedback ("I agree with Grammar, but...")
- ‚úÖ Cross-domain insights (math reviewer notices grammar issue)

**Example of Round 2 value**:
```
Round 1:
- MathReviewer: "Equation is correct"
- GrammarReviewer: "No issues"

Round 2 (after seeing each other):
- MathReviewer: "I see Grammar didn't mention it, but the equation 
  notation is technically correct but confusing for 9th graders. 
  Consider simplifying."
```

**Trade-off**: 2x the API calls, but significantly better feedback quality.
</details>

<details>
<summary><strong>Question 8: How do you handle rate limits in multi-agent systems?</strong></summary>

**Answer**:

**Problem**: Sending 6 parallel requests may exceed API rate limits

**Gemini API limits** (2025):
- Free tier: 15 requests/minute
- Paid tier: 60 requests/minute

**Solution 1: Batched Parallel Execution**
```python
batch_size = 2  # Adjust based on your API tier
batches = [reviewers[i:i+batch_size] for i in range(0, len(reviewers), batch_size)]

all_reviews = []
for batch in batches:
    tasks = [r.review(article) for r in batch]
    batch_reviews = await asyncio.gather(*tasks, return_exceptions=True)
    all_reviews.extend(batch_reviews)
    await asyncio.sleep(1.0)  # Small delay between batches

# Result: 3 batches √ó 3.5s = ~10.5s (vs 30s sequential, 5s full parallel)
```

**Solution 2: Semaphore (Concurrency Control)**
```python
semaphore = asyncio.Semaphore(10)  # Max 10 concurrent

async def review_with_limit(reviewer, article):
    async with semaphore:
        return await reviewer.review(article)

tasks = [review_with_limit(r, article) for r in reviewers]
reviews = await asyncio.gather(*tasks, return_exceptions=True)
```

**Solution 3: Exponential Backoff (Retry on 429)**
```python
async def review_with_retry(reviewer, article, max_retries=3):
    for attempt in range(max_retries):
        try:
            return await reviewer.review(article)
        except RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt  # 1s, 2s, 4s
            await asyncio.sleep(wait_time)
```

**Best practice**: Use batching + backoff together for production systems.
</details>

---

### Related Patterns

The Multi-Agent Collaboration pattern (Pattern 23) works best when combined with:

- **Pattern 18 (Reflection)**: How agents incorporate feedback
  - Writer uses consolidated review to revise article
  - Iterative improvement through multiple review cycles
  - **Code**: `composable_app/agents/generic_writer_agent.py:74-89` (revise_article)

- **Pattern 19 (Dependency Injection)**: ReviewerPanel uses DI
  - AbstractWriter interface allows swapping writer implementations
  - WriterFactory creates writers based on type
  - **Tutorial**: [`architecture_deep_dive.md`](../concepts/architecture_deep_dive.md)

- **Pattern 25 (Prompt Caching)**: Reuse system prompts
  - Each reviewer's system prompt is loaded once and reused
  - Significant cost savings for multi-round reviews
  - **Code**: `composable_app/utils/prompt_service.py`

- **Pattern 32 (Guardrails)**: Input validation before review
  - Validate article meets basic requirements before sending to panel
  - Catch inappropriate content early (before expensive 6-agent review)
  - **Tutorial**: [`llm_as_judge_tutorial.ipynb`](llm_as_judge_tutorial.ipynb)

---

### Book Reference

> **Pattern 23: Multi-Agent Collaboration** is detailed in:
>
> *Generative AI Design Patterns* by Valliappa Lakshmanan and Martin Hapke (O'Reilly, 2025)
>
> **Chapter**: Multi-Agent Systems and Orchestration (pages TBD)
>
> **Key concepts covered**:
> - Parallel vs. sequential agent orchestration
> - Specialist and adversarial agent design
> - Consolidation and aggregation patterns
> - Performance optimization with async/await
> - Cost-effective multi-agent architectures

**Related chapters**:
- Chapter 6: RAG pattern (retrieval before generation)
- Chapter 8: Reflection pattern (incorporating feedback)
- Chapter 9: Dependency injection (modular agent design)

---

### Next Steps

**To practice multi-agent patterns**:

1. **Modify ReviewerPanel for parallel execution**
   - Exercise: Update `do_first_round_reviews()` to use `asyncio.gather()`
   - File: `composable_app/agents/reviewer_panel.py:100-131`
   - Test: Run performance comparison (should see 5-6x speedup)

2. **Design your own reviewer persona**
   - Exercise: Create a "STEM Equity Advocate" reviewer
   - Requirements: Check for diverse examples, accessible language, low-cost experiments
   - Test: Run on sample article about physics or chemistry

3. **Implement batched execution**
   - Exercise: Add semaphore-based rate limiting to ReviewerPanel
   - Constraint: Max 2 concurrent reviewer calls
   - Measure: Compare timing (should be between sequential and full parallel)

4. **Enhance secretary consolidation**
   - Exercise: Improve secretary's system prompt to better resolve conflicts
   - Test: Create article with known conservative/liberal tension
   - Evaluate: Does secretary provide balanced, actionable guidance?

**To deepen understanding**:

- ‚úÖ **Complete tutorial**: [`reflection_pattern.md`](../concepts/reflection_pattern.md)
  - Learn how writers incorporate consolidated feedback
  
- ‚úÖ **Complete tutorial**: [`architecture_deep_dive.md`](../concepts/architecture_deep_dive.md)
  - Understand dependency injection used in ReviewerPanel

- ‚úÖ **Explore codebase**: Read `composable_app/agents/reviewer_panel.py`
  - Study full implementation with error handling
  - See how logging and evaluation recording work

- ‚úÖ **Real-world application**: Use ReviewerPanel in your own domain
  - Replace curriculum reviewers with your domain experts
  - Design adversarial reviewers for your stakeholders
  - Test on real content from your application

---

### Congratulations! üéâ

You've completed the Multi-Agent Collaboration pattern tutorial. You now understand:

‚úÖ Parallel vs. sequential orchestration patterns
‚úÖ How to design specialist and adversarial reviewers
‚úÖ Secretary consolidation for conflict resolution
‚úÖ Performance optimization with `asyncio.gather()`
‚úÖ Common pitfalls and production best practices
‚úÖ Cost and rate limit considerations

**Your multi-agent systems are now production-ready!**

---

### Feedback and Questions

Have questions or suggestions? Please:
- üìù Open an issue: [github.com/composable-app/issues](https://github.com/)
- üí¨ Join discussions: [github.com/composable-app/discussions](https://github.com/)
- üìß Email the authors: [provided in book]

**Tutorial changelog**: See [`TUTORIAL_CHANGELOG.md`](../../TUTORIAL_CHANGELOG.md) for updates when code changes.

---

---

## Part 7: Common Pitfalls and How to Avoid Them

When implementing multi-agent systems with parallel execution, several common pitfalls can cause issues. Here's how to recognize and avoid them.

### Pitfall 1: Not Using `return_exceptions=True` in asyncio.gather()

**Problem**: By default, `asyncio.gather()` raises the first exception and cancels remaining tasks.

```python
# ‚ùå Bad: One failure stops everything
reviews = await asyncio.gather(
    reviewer1.review(article),
    reviewer2.review(article),  # If this fails...
    reviewer3.review(article),  # ...this never runs
)
```

**Consequence**: If one reviewer fails (API timeout, rate limit, bug), you lose ALL reviews.

**Solution**: Use `return_exceptions=True` to get partial results:

```python
# ‚úÖ Good: Continue despite failures
reviews = await asyncio.gather(
    reviewer1.review(article),
    reviewer2.review(article),  # If this fails...
    reviewer3.review(article),  # ...this still runs!
    return_exceptions=True
)

# Filter out exceptions
successful_reviews = [r for r in reviews if not isinstance(r, Exception)]
failed_reviews = [r for r in reviews if isinstance(r, Exception)]

# Log failures for debugging
for exc in failed_reviews:
    logger.error(f"Reviewer failed: {exc}")
```

**Code reference**: See demo in cell above comparing with/without `return_exceptions=True`

---

### Pitfall 2: Poor Reviewer Persona Design

**Problem**: Vague or overlapping reviewer mandates lead to redundant or low-quality feedback.

**Bad example**:
```python
# ‚ùå Vague, overlapping roles
Reviewer1: "Review the article for quality"
Reviewer2: "Check if the article is good"
Reviewer3: "Evaluate the content"
```

**Result**: All three reviewers give similar generic feedback like "looks good" or "needs improvement."

**Solution**: Give each reviewer a **specific, focused mandate**:

```python
# ‚úÖ Clear, non-overlapping roles
GrammarReviewer: "Check spelling, grammar, punctuation. Report specific errors with line numbers."
MathReviewer: "Verify all equations, calculations, and mathematical notation for accuracy."
DistrictRep: "Ensure alignment with curriculum standards and appropriate length for printing budget."
```

**Design principles** (from Part 4):
1. ‚úÖ Clear, focused mandate (not "check quality")
2. ‚úÖ Specific values and priorities (what does this reviewer care about?)
3. ‚úÖ Representative of real stakeholders (not stereotypes)
4. ‚úÖ Creates productive tension with other reviewers (not artificial conflicts)

---

### Pitfall 3: Ignoring Rate Limits

**Problem**: Sending too many parallel requests triggers API rate limits.

**Bad example**:
```python
# ‚ùå Sends 100 requests instantly
reviewers = [ReviewerAgent(f"Reviewer{i}") for i in range(100)]
reviews = await asyncio.gather(*[r.review(article) for r in reviewers])
# Result: API returns 429 Too Many Requests, many reviews fail
```

**Gemini API limits** (as of 2025):
- Free tier: 15 requests/minute
- Paid tier: 60 requests/minute

**Solution 1: Batched parallel execution** (shown in Part 6):
```python
# ‚úÖ Process in batches to respect rate limits
batch_size = 10  # Adjust based on your API tier
batches = [reviewers[i:i+batch_size] for i in range(0, len(reviewers), batch_size)]

all_reviews = []
for batch in batches:
    tasks = [r.review(article) for r in batch]
    batch_reviews = await asyncio.gather(*tasks, return_exceptions=True)
    all_reviews.extend(batch_reviews)
    await asyncio.sleep(1.0)  # Small delay between batches
```

**Solution 2: Use semaphore for concurrency control**:
```python
# ‚úÖ Limit concurrent requests with semaphore
semaphore = asyncio.Semaphore(10)  # Max 10 concurrent requests

async def review_with_limit(reviewer, article):
    async with semaphore:  # Acquire semaphore
        return await reviewer.review(article)
    # Semaphore released automatically

tasks = [review_with_limit(r, article) for r in reviewers]
reviews = await asyncio.gather(*tasks, return_exceptions=True)
```

---

### Pitfall 4: Secretary Doesn't Resolve Conflicts

**Problem**: Secretary just concatenates feedback without prioritization or conflict resolution.

**Bad consolidation**:
```
ConservativeParent: Remove paragraph 3
LiberalParent: Expand paragraph 3

Secretary consolidation:
- Remove paragraph 3 (ConservativeParent)
- Expand paragraph 3 (LiberalParent)

Writer: "These contradict each other! What should I do?" üòï
```

**Solution**: Secretary must **explicitly resolve conflicts** with balanced guidance:

**Good consolidation**:
```
REQUIRES DISCUSSION: Paragraph 3 feedback (conflicting)

Conservative parent concern: Content may be controversial
Liberal parent concern: Context is educationally important

RECOMMENDATION: Keep paragraph but revise for balance:
1. Use neutral, factual language (not judgmental)
2. Present multiple perspectives (not just one side)
3. Keep length moderate (addresses both concerns)

This approach maintains educational value while respecting diverse community values.
```

**Implementation tip**: In secretary's system prompt, explicitly instruct it to:
- Identify conflicts: "When reviewers disagree, label it REQUIRES DISCUSSION"
- Analyze both perspectives: "Explain each reviewer's concern"
- Propose balanced solutions: "Recommend approach that addresses both concerns"

---

### Pitfall 5: No Logging or Observability

**Problem**: When something goes wrong, you have no visibility into what happened.

**Bad example**:
```python
# ‚ùå No logging
reviews = await asyncio.gather(*review_tasks)
consolidated = await secretary.consolidate(reviews)
return consolidated  # If something fails, no way to debug
```

**Solution**: Log all agent interactions for debugging and evaluation:

```python
# ‚úÖ Comprehensive logging
import logging
from composable_app.utils import save_for_eval as evals

logger = logging.getLogger(__name__)

# Log individual reviews
for reviewer, review in reviews:
    await evals.record_ai_response(
        f"{reviewer.name}_review",
        ai_input={"article": article, "topic": topic},
        ai_response=review
    )
    logger.info(f"{reviewer.name} completed review ({len(review)} chars)")

# Log consolidation
await evals.record_ai_response(
    "consolidated_review",
    ai_input={"reviews": reviews},
    ai_response=consolidated
)

# Log failures
for exc in failed_reviews:
    logger.error(f"Review failed: {type(exc).__name__}: {exc}")
```

**Benefits**:
- üìä Track which reviewers fail most often
- üêõ Debug why secretary's consolidation is poor
- üìà Measure review quality over time
- üí∞ Monitor token usage and costs

**Code reference**: `composable_app/agents/reviewer_panel.py:56` shows logging in `ReviewerAgent.review()`

---

### Pitfall 6: Sequential Execution When Parallel Would Work

**Problem**: Using sequential execution for independent tasks wastes time.

**Bad example**:
```python
# ‚ùå Sequential when agents are independent (Round 1)
for reviewer in review_panel:
    review = await reviewer.review(article, reviews_so_far=[])
    reviews.append(review)
# Takes: 6 √ó 5s = 30 seconds
```

**When to use sequential**:
- ‚úÖ Tasks are **dependent** (Round 2 reviews need to see Round 1 results)
- ‚úÖ Debugging (linear execution is easier to trace)
- ‚úÖ Strict rate limits (can't handle concurrent requests)

**When to use parallel**:
- ‚úÖ Tasks are **independent** (Round 1 reviews don't need each other)
- ‚úÖ I/O-bound operations (LLM API calls, database queries)
- ‚úÖ Production environment with adequate rate limits

**Solution**: Use parallel for Round 1, sequential for Round 2:

```python
# ‚úÖ Round 1: Parallel (independent reviews)
review_tasks = [r.review(article, reviews_so_far=[]) for r in review_panel]
round1_reviews = await asyncio.gather(*review_tasks, return_exceptions=True)
# Takes: max(5s) = 5 seconds (6x faster!)

# ‚úÖ Round 2: Sequential (reviews depend on Round 1)
round2_reviews = []
for reviewer in review_panel:
    review = await reviewer.review(article, reviews_so_far=round1_reviews)
    round2_reviews.append(review)
# Takes: 6 √ó 5s = 30 seconds (but necessary for correct results)
```

---

### Pitfall 7: Forgetting Cost Implications

**Problem**: Running many agents in parallel doesn't change API costs, but can surprise users.

**Misconception**:
```
"Parallel execution is 6x faster, so it must be cheaper!"
```

**Reality**:
- Sequential: 6 reviewers √ó 2,000 tokens = 12,000 tokens
- Parallel: 6 reviewers √ó 2,000 tokens = 12,000 tokens
- **Cost is the same!** You're making the same API calls, just faster.

**Solution**: Always display cost warnings in notebooks and UIs:

```python
# ‚úÖ Clear cost warning
print("""
‚ö†Ô∏è  COST ESTIMATE: $0.20 - $0.40

This operation will make 6 parallel reviewer calls plus 1 secretary consolidation:
- 6 reviewer calls: ~9,000 input tokens
- 1 secretary call: ~3,000 input tokens
- Total output: ~2,000 tokens

Estimated cost: $0.001 (6 calls) + $0.0003 (1 call) ‚âà $0.0013 per run

Press Enter to continue or Ctrl+C to cancel...
""")
input()
```

**Cost tracking**:
```python
# Track token usage
total_input_tokens = sum(r.usage.input_tokens for r in review_results)
total_output_tokens = sum(r.usage.output_tokens for r in review_results)

print(f"üí∞ Total tokens used: {total_input_tokens} input + {total_output_tokens} output")
print(f"üíµ Estimated cost: ${(total_input_tokens * 0.075 + total_output_tokens * 0.30) / 1_000_000:.4f}")
```

---

### Pitfall 8: Not Testing with Mocks First

**Problem**: Developing multi-agent systems directly against live APIs is slow and expensive.

**Bad workflow**:
```
1. Write code for 6-agent system
2. Run against live API
3. Wait 30 seconds per test
4. Discover bug
5. Fix bug
6. Repeat (costs add up quickly)
```

**Solution**: Use mocks for development, real APIs for final testing:

```python
# ‚úÖ Development: Use mocks
async def mock_review(reviewer_name: str) -> str:
    await asyncio.sleep(0.1)  # Fast
    return f"{reviewer_name}: Mock review feedback"

# Test your orchestration logic
reviews = await asyncio.gather(*[mock_review(r) for r in reviewers])
consolidated = mock_secretary_consolidate(reviews)

# ‚úÖ Production: Use real APIs
from composable_app.agents.reviewer_panel import ReviewerAgent
reviewer = ReviewerAgent(Reviewer.GRAMMAR_REVIEWER)
review = await reviewer.review(topic, article, [])
```

**Benefits**:
- üöÄ Fast iteration (0.1s vs 5s per test)
- üí∞ Zero API costs during development
- üß™ Easy to test edge cases (simulate failures, timeouts)
- üîß Focus on logic, not API quirks

**Code reference**: See Part 6 performance profiling cells using `realistic_mock_review()`

---

### Checklist: Multi-Agent System Health

Before deploying your multi-agent system to production, verify:

- [ ] ‚úÖ Using `return_exceptions=True` in `asyncio.gather()`
- [ ] ‚úÖ Each reviewer has clear, specific mandate
- [ ] ‚úÖ Rate limiting handled with batching or semaphores
- [ ] ‚úÖ Secretary explicitly resolves conflicting feedback
- [ ] ‚úÖ All agent interactions logged for debugging
- [ ] ‚úÖ Parallel execution for independent tasks, sequential for dependent
- [ ] ‚úÖ Cost warnings displayed to users
- [ ] ‚úÖ Development workflow uses mocks, production uses real APIs
- [ ] ‚úÖ Error handling covers API timeouts, rate limits, malformed responses
- [ ] ‚úÖ Performance profiled (sequential vs parallel timing)

---

### Summary: Common Pitfalls

| Pitfall | Impact | Solution |
|---------|--------|----------|
| No `return_exceptions=True` | One failure loses all reviews | Add `return_exceptions=True` to gather() |
| Vague reviewer personas | Redundant, low-quality feedback | Clear, focused mandates for each reviewer |
| Ignoring rate limits | API errors, failed reviews | Batched execution or semaphore control |
| No conflict resolution | Confusing, contradictory guidance | Secretary must explicitly resolve conflicts |
| No logging | Can't debug failures | Log all agent interactions with evals |
| Sequential when parallel works | 6x slower than necessary | Use parallel for independent tasks |
| Forgetting costs | Budget surprises | Display cost warnings, track token usage |
| Testing against live API | Slow, expensive development | Use mocks for dev, real APIs for final test |

**Next**: Self-assessment questions to test your understanding!

---

In [None]:
# Strategy 3: Batched parallel execution
print("STRATEGY 3: BATCHED PARALLEL (Rate Limit Management)")
print("=" * 70)
print("Running reviewers in batches of 2...\n")

start_time = time.time()
reviews_batched = []

# Split reviewers into batches of 2
batch_size = 2
batches = [reviewers[i:i+batch_size] for i in range(0, len(reviewers), batch_size)]

print(f"Total batches: {len(batches)}")
print(f"Batch size: {batch_size}\n")

for batch_num, batch in enumerate(batches, 1):
    print(f"Batch {batch_num}/{len(batches)}: {batch}")
    
    # Run this batch in parallel
    tasks = [realistic_mock_review(reviewer) for reviewer in batch]
    batch_reviews = await asyncio.gather(*tasks)
    reviews_batched.extend(batch_reviews)
    
    elapsed = time.time() - start_time
    print(f"  Batch completed. Elapsed total: {elapsed:.2f}s\n")

total_time_batched = time.time() - start_time

# Calculate metrics
metrics_batched = PerformanceMetrics(
    strategy="Batched Parallel (batch=2)",
    total_time=total_time_batched,
    num_reviews=len(reviews_batched),
    speedup=total_time_seq / total_time_batched,
    throughput=len(reviews_batched) / total_time_batched
)

print("=" * 70)
print(metrics_batched)
print(f"\n‚è±Ô∏è  Expected: ~10.5 seconds (3 batches √ó 3.5s average)")
print(f"‚è±Ô∏è  Actual: {total_time_batched:.2f} seconds")
print(f"\nüöÄ SPEEDUP: {metrics_batched.speedup:.1f}x faster than sequential")
print(f"‚öñÔ∏è  TRADE-OFF: Slower than full parallel, but respects rate limits")

### Strategy 3: Batched Parallel (Rate Limit Management)

In production, you may hit API rate limits if you send too many concurrent requests. A **batched approach** balances speed and rate limits.

In [None]:
# Strategy 2: Parallel execution with asyncio.gather()
print("STRATEGY 2: PARALLEL EXECUTION (asyncio.gather)")
print("=" * 70)
print("Running all reviewers simultaneously...\n")

start_time = time.time()

# Create tasks for all reviewers
tasks = [realistic_mock_review(reviewer) for reviewer in reviewers]

print(f"Starting {len(tasks)} reviewers in parallel...")

# Execute all in parallel
reviews_parallel = await asyncio.gather(*tasks)

total_time_parallel = time.time() - start_time

# Calculate metrics
metrics_parallel = PerformanceMetrics(
    strategy="Parallel (asyncio.gather)",
    total_time=total_time_parallel,
    num_reviews=len(reviews_parallel),
    speedup=total_time_seq / total_time_parallel,
    throughput=len(reviews_parallel) / total_time_parallel
)

print(f"All reviewers completed!\n")
print("=" * 70)
print(metrics_parallel)
print(f"\n‚è±Ô∏è  Expected: ~3.5-5.0 seconds (max of 5 reviewers)")
print(f"‚è±Ô∏è  Actual: {total_time_parallel:.2f} seconds")
print(f"\nüöÄ SPEEDUP: {metrics_parallel.speedup:.1f}x faster than sequential!")

### Strategy 2: Parallel Execution with asyncio.gather()

In [None]:
# Strategy 1: Sequential execution
reviewers = ["Grammar", "Math", "District", "ConservativeParent", "LiberalParent"]

print("STRATEGY 1: SEQUENTIAL EXECUTION")
print("=" * 70)
print("Running reviewers one at a time...\n")

start_time = time.time()
reviews_sequential = []

for i, reviewer in enumerate(reviewers, 1):
    print(f"[{i}/{len(reviewers)}] Starting {reviewer}...")
    review = await realistic_mock_review(reviewer)
    reviews_sequential.append(review)
    elapsed = time.time() - start_time
    print(f"    Completed. Elapsed total: {elapsed:.2f}s\n")

total_time_seq = time.time() - start_time

# Calculate metrics
metrics_sequential = PerformanceMetrics(
    strategy="Sequential",
    total_time=total_time_seq,
    num_reviews=len(reviews_sequential),
    speedup=1.0,  # Baseline
    throughput=len(reviews_sequential) / total_time_seq
)

print("=" * 70)
print(metrics_sequential)
print("\n‚è±Ô∏è  Expected: ~17.5 seconds (5 reviewers √ó 3.5s average)")
print(f"‚è±Ô∏è  Actual: {total_time_seq:.2f} seconds")

### Strategy 1: Sequential Execution (Baseline)

In [None]:
# Performance testing configuration
import random
from dataclasses import dataclass
from typing import List

@dataclass
class PerformanceMetrics:
    """Track performance metrics for comparison."""
    strategy: str
    total_time: float
    num_reviews: int
    speedup: float
    throughput: float  # reviews per second
    
    def __str__(self):
        return f"""
Strategy: {self.strategy}
  Total time:    {self.total_time:.2f}s
  Reviews:       {self.num_reviews}
  Speedup:       {self.speedup:.2f}x
  Throughput:    {self.throughput:.2f} reviews/sec
"""

# Mock reviewer with realistic API latency
async def realistic_mock_review(
    reviewer_name: str,
    min_delay: float = 2.0,
    max_delay: float = 5.0
) -> str:
    """
    Simulate a reviewer with variable API latency.
    
    Real LLM APIs have variable response times depending on:
    - Model load
    - Prompt complexity
    - Output length
    - Network conditions
    """
    delay = random.uniform(min_delay, max_delay)
    await asyncio.sleep(delay)
    return f"{reviewer_name} review completed (took {delay:.2f}s)"

print("‚úÖ Performance testing setup complete")
print("\nRealistic delays configured:")
print("  - Minimum: 2.0 seconds (fast response)")
print("  - Maximum: 5.0 seconds (slow response)")
print("  - Average: ~3.5 seconds (typical)")
print("\nThis simulates real Gemini API latency without API costs.")

## Part 6: Performance Profiling - Sequential vs. Parallel

Now let's dive deep into **performance analysis** to quantify the benefits of parallel execution.

### What We'll Measure

We'll compare three execution strategies:

1. **Sequential (baseline)** - One reviewer at a time
2. **Parallel with asyncio.gather()** - All reviewers simultaneously
3. **Batched parallel** - Groups of reviewers in parallel (for rate limit management)

For each strategy, we'll measure:
- ‚è±Ô∏è **Total execution time**
- üìä **Speedup factor** (vs. sequential)
- üí∞ **API cost** (same for all, but we'll verify)
- üéØ **Throughput** (reviews per second)

---

### Performance Testing Setup

We'll use mock reviewers with realistic delays to simulate API latency without actual API costs.

### Key Consolidation Strategies

The secretary uses several strategies to create useful consolidated feedback:

#### Strategy 1: Prioritization by Impact

```
CRITICAL > HIGH PRIORITY > MODERATE > OPTIONAL
```

**Criteria for CRITICAL**:
- ‚ùå Factual errors that mislead students
- ‚ùå Safety issues
- ‚ùå Legal/compliance violations
- ‚ùå Spelling/grammar in key concepts

**Criteria for HIGH PRIORITY**:
- ‚ö†Ô∏è Clarity issues that confuse students
- ‚ö†Ô∏è Missing curriculum alignments
- ‚ö†Ô∏è Inefficient structure

**Criteria for OPTIONAL**:
- üí° Enhancement suggestions
- üí° Alternative approaches
- üí° Future improvements

#### Strategy 2: Synthesis of Related Feedback

**Before consolidation**:
```
GrammarReviewer: "Sentence structure in paragraph 2 is awkward."
DistrictRep: "Paragraph 2 is hard to understand."
LiberalParent: "Paragraph 2 uses jargon that ELL students might struggle with."
```

**After consolidation**:
```
HIGH PRIORITY: Revise paragraph 2 for clarity
- Multiple reviewers noted comprehension issues
- Specific problems: awkward sentence structure, jargon usage
- Recommendation: Simplify sentences and define technical terms
- Benefits: Improves understanding for all students, especially ELL learners
```

#### Strategy 3: Conflict Resolution

**Conflicting feedback**:
```
ConservativeParent: "Remove discussion of evolution."
LiberalParent: "Expand evolution section with more details."
```

**Secretary's resolution**:
```
REQUIRES DISCUSSION: Evolution content (conflicting feedback)

Current state: Brief mention of evolution in context of biological adaptation

Conservative parent concern: Evolution is controversial in some communities
Liberal parent concern: Evolution is foundational to understanding biology

RECOMMENDATION: Keep current brief mention with these modifications:
1. Add note: "Evolution by natural selection is the scientific consensus 
   explanation for biological diversity" (factual, not advocacy)
2. Provide "opt-out" notice for districts where evolution is contested
3. Offer alternative activities focusing on observable adaptation (birds, bacteria)

This approach:
- Maintains scientific integrity (liberal parent concern)
- Respects community values (conservative parent concern)  
- Gives districts flexibility (admin/district concern)
```

#### Strategy 4: Preserving Positive Feedback

Many consolidations focus only on problems. The secretary also highlights what's working:

```
APPROVED ASPECTS:
‚úÖ Clear learning objectives (DistrictRep)
‚úÖ Age-appropriate language (GrammarReviewer)
‚úÖ Good use of real-world examples (all reviewers)
‚úÖ Diverse representation in examples (LiberalParent)

‚Üí These elements should be preserved in revisions
```

**Why this matters**: Prevents the writer from "fixing" things that aren't broken.

---

### Design Patterns in Secretary Consolidation

The secretary implementation uses several design patterns:

#### Pattern 1: Aggregator Pattern

```python
# Collects outputs from multiple agents (reviewers)
# Produces single unified output (consolidated review)

Input:  [Review1, Review2, Review3, Review4, Review5]
         ‚Üì
Aggregator (Secretary)
         ‚Üì
Output: ConsolidatedReview
```

#### Pattern 2: Decorator Pattern (Implicit)

The secretary "decorates" raw reviews with:
- Priority levels
- Conflict resolution
- Actionable recommendations

#### Pattern 3: Chain of Responsibility (Two-Round Review)

```
Round 1: Independent reviews ‚Üí Secretary Consolidation ‚Üí Interim Feedback
Round 2: Informed reviews (with Round 1 feedback) ‚Üí Secretary Consolidation ‚Üí Final Feedback
```

---

### Real-World Secretary Output

In production, the secretary's output becomes the input to the **Writer's revision** (Pattern 18: Reflection).

**Full workflow**:
```
1. Writer creates draft
2. ReviewerPanel evaluates (6 reviewers in parallel)
3. Secretary consolidates feedback
4. Writer revises based on consolidated feedback (not raw reviews)
5. Repeat until approved
```

**Key benefit**: Writer gets **one clear set of instructions** instead of 6 potentially conflicting perspectives.

---

### Key Takeaways: Secretary Pattern

‚úÖ **Aggregation**: Combines multiple agent outputs into one
‚úÖ **Prioritization**: Ranks feedback by importance  
‚úÖ **Conflict resolution**: Handles disagreements gracefully
‚úÖ **Simplification**: Converts technical feedback to action items
‚úÖ **Synthesis**: Combines related feedback
‚úÖ **Preservation**: Highlights what's working (not just problems)

**When to use**:
- ‚úÖ Multiple agents provide overlapping feedback
- ‚úÖ Agents may disagree
- ‚úÖ Downstream consumer (writer) needs clarity, not complexity
- ‚úÖ Want to preserve positive feedback alongside critiques

---

In [None]:
# Manual secretary consolidation (simulating what the LLM would do)
def mock_secretary_consolidate(reviews: list) -> str:
    """Simulate secretary's consolidation logic."""
    
    consolidation = """
CONSOLIDATED REVIEW SUMMARY
================================

CRITICAL (Fix Before Publication):
1. Spelling Errors (GrammarReviewer)
   - Line 5: 'occured' ‚Üí 'occurred'
   - Line 12: 'seperate' ‚Üí 'separate'  
   - Line 18: 'recieve' ‚Üí 'receive'

APPROVED ASPECTS:
2. Length and Structure (DistrictRep)
   - ‚úÖ Article length (~400 words) is appropriate
   - ‚úÖ Well-structured and relatable examples
   - ‚úÖ No changes needed

3. Content Appropriateness (ConservativeParent)
   - ‚úÖ Factually accurate
   - ‚úÖ No controversial content
   - ‚úÖ Approved for use

OPTIONAL ENHANCEMENT (Consider for Future):
4. Contemporary Connection (LiberalParent)
   - Suggestion: Add brief section on climate change impact on photosynthesis
   - Rationale: Connects to student interest in environmental issues
   - Placement: Could add 1-2 sentences at end without increasing length significantly
   - Decision: Optional - discuss with curriculum team

OVERALL RECOMMENDATION:
Fix the 3 critical spelling errors, then publish. Climate change connection 
is a good idea but not required for initial publication.
"""
    return consolidation.strip()

# Apply consolidation
consolidated_feedback = mock_secretary_consolidate(mock_reviews)

print("\nSECRETARY CONSOLIDATED FEEDBACK")
print("=" * 70)
print(consolidated_feedback)

print("\n" + "=" * 70)
print("\n‚úÖ Benefits of consolidation:")
print("  1. Clear priorities (CRITICAL vs APPROVED vs OPTIONAL)")
print("  2. Specific action items (exact line numbers, exact changes)")
print("  3. Positive feedback preserved (approved aspects listed)")
print("  4. Conflicts resolved (climate change suggestion framed as optional)")
print("  5. Overall recommendation provided (fix 3 errors, then publish)")

In [None]:
# Mock reviews from different reviewers
mock_reviews = [
    (Reviewer.GRAMMAR_REVIEWER, 
     "Fix 3 spelling errors: 'occured' (line 5) should be 'occurred', "
     "'seperate' (line 12) should be 'separate', 'recieve' (line 18) should be 'receive'."),
    
    (Reviewer.DISTRICT_REP,
     "Article is clear and well-structured. Length is appropriate at ~400 words. "
     "Good use of examples that students can relate to."),
    
    (Reviewer.CONSERVATIVE_PARENT,
     "The photosynthesis article is factually accurate and appropriate. "
     "No controversial content detected. Approve for use."),
    
    (Reviewer.LIBERAL_PARENT,
     "Good scientific content. Consider adding: How does climate change affect photosynthesis? "
     "This connects science to contemporary environmental issues students care about."),
]

print("MOCK REVIEWER FEEDBACK")
print("=" * 70)
for i, (reviewer, feedback) in enumerate(mock_reviews, 1):
    print(f"\n{i}. {reviewer.name}:")
    print(f"   {feedback}")

print("\n" + "=" * 70)

### How Secretary Consolidation Works

Let's trace through the consolidation process step by step.

#### Step 1: Collect All Reviews

The secretary receives reviews as a list of `(Reviewer, review_text)` tuples:

```python
reviews_so_far = [
    (Reviewer.GRAMMAR_REVIEWER, "Fix spelling errors on lines 5, 8, 12..."),
    (Reviewer.DISTRICT_REP, "Article is too long. Reduce to 300 words..."),
    (Reviewer.CONSERVATIVE_PARENT, "Remove paragraph 3..."),
    (Reviewer.LIBERAL_PARENT, "Expand paragraph 3..."),
    (Reviewer.SCHOOL_ADMIN, "Add safety warning..."),
]
```

#### Step 2: Format for LLM

Reviews are formatted with clear delimiters:

```python
reviews_text = []
for reviewer, review in reviews_so_far:
    reviews_text.append(f"BEGIN review by {reviewer.name}:\\n{review}\\nEND review\\n")
```

**Formatted output**:
```
BEGIN review by GRAMMAR_REVIEWER:
Fix spelling errors on lines 5, 8, 12...
END review

BEGIN review by DISTRICT_REP:
Article is too long. Reduce to 300 words...
END review

BEGIN review by CONSERVATIVE_PARENT:
Remove paragraph 3...
END review

BEGIN review by LIBERAL_PARENT:
Expand paragraph 3...
END review
```

#### Step 3: Create Consolidation Prompt

```python
prompt_vars = {
    "prompt_name": "Secretary_consolidate_reviews",
    "topic": topic,
    "article": article,
    "reviews": reviews_text  # All formatted reviews
}
prompt = PromptService.render_prompt(**prompt_vars)
```

#### Step 4: LLM Consolidation

The LLM (acting as secretary) analyzes all reviews and produces consolidated feedback:

**Example consolidated output**:
```
CRITICAL (Fix Immediately):
1. Add safety warning and parental consent requirement for chemistry experiment (SchoolAdmin)
2. Fix 12 spelling errors throughout the article (GrammarReviewer)

HIGH PRIORITY (Address Before Final Review):
3. Reduce article length from 500 to 350 words (DistrictRep) - suggest cutting
   examples rather than core content
4. Math notation: Use √ó instead of * for multiplication (MathReviewer)

REQUIRES DISCUSSION (Conflicting Feedback):
5. Paragraph 3 feedback:
   - Conservative parent wants to remove discussion of colonialism's negative impacts
   - Liberal parent wants to expand this section
   
   RECOMMENDATION: Keep paragraph but revise for balance:
   - Acknowledge both positive contributions AND negative impacts
   - Use neutral, factual language
   - Keep length moderate (current length is acceptable)
   - Focus on historical context rather than moral judgments
   
   This approach addresses both concerns: Conservative parent gets more balanced 
   framing, Liberal parent gets to keep the critical context.

OPTIONAL (Consider for Future Revisions):
6. Consider adding more diverse examples in future revisions
```

---

### Simulating Secretary Consolidation

Let's create a mock consolidation example to see the pattern in action:

In [None]:
# Read the secretary's system prompt
print("SECRETARY SYSTEM PROMPT")
print("=" * 70)
secretary_prompt = (prompts_dir / "secretary_system_prompt.j2").read_text()
print(secretary_prompt.strip())

print("\n" + "=" * 70)
print("\nüîç Analysis:")
print("  - Role: Secretary of curriculum review panel")
print("  - Task: Summarize ALL reviewer feedback")
print("  - Output: Specific directions for the writer")
print("  - Goal: Enable writer to revise based on consolidated feedback")
print("\nNote: The prompt is deliberately simple, allowing the LLM to use")
print("its reasoning capabilities to handle complex consolidation tasks.")

## Part 5: Secretary Consolidation Pattern

After 6 reviewers provide their feedback, we face a critical challenge: **how do we turn diverse, sometimes conflicting feedback into actionable guidance?**

This is where the **Secretary pattern** comes in.

### The Problem: Information Overload

Without consolidation, the writer receives:

```
GrammarReviewer: "Fix 12 spelling errors on lines 5, 8, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39."

DistrictRepReviewer: "Article is too long. Reduce from 500 words to 300 words to save printing costs."

ConservativeParent: "Remove paragraph 3 which discusses the negative impacts of colonialism. 
                     Focus on positive contributions of Western civilization."

LiberalParent: "Paragraph 3 is the most important! It provides critical context about power 
                dynamics. Actually expand it with more examples."

MathReviewer: "Equation on line 15 should use √ó instead of * for multiplication."

SchoolAdmin: "Chemistry experiment on page 2 requires safety warning and parental consent form."
```

**Writer's reaction**: üòµ "Where do I even start? Conservative and Liberal parents directly contradict each other!"

---

### The Solution: Secretary Consolidation

The **PanelSecretary** acts as a **meta-reviewer** that:

1. **Synthesizes** - Combines related feedback
2. **Prioritizes** - Ranks by importance (CRITICAL ‚Üí HIGH ‚Üí OPTIONAL)
3. **Resolves conflicts** - Provides balanced guidance when reviewers disagree
4. **Simplifies** - Converts technical feedback into clear action items

**Code Reference**: [`composable_app/agents/reviewer_panel.py:65-98`](../../agents/reviewer_panel.py#L65-L98)

### Secretary's System Prompt

Let's examine what instructions the secretary receives:

### Analysis: Contrasting Personas

Notice the stark differences between the two adversarial reviewers:

| Aspect | Conservative Parent | Liberal Parent |
|--------|-------------------|----------------|
| **History focus** | Patriotic narratives, Western civilization | Individual agency, critical thinking |
| **Controversial topics** | Downplay slavery, colonialism | Highlight power dynamics, social justice |
| **Math approach** | Traditional algorithms, foundational skills | Real-world contexts, problem-solving |
| **Values** | Civic virtue, accuracy, efficiency | Diverse perspectives, contemporary relevance |

**Why include both?**
- ‚úÖ **Surface trade-offs**: Make implicit tensions explicit
- ‚úÖ **Broader acceptance**: Address concerns from both ends of spectrum
- ‚úÖ **Better content**: Balancing diverse viewpoints creates richer material
- ‚úÖ **Risk mitigation**: Catch issues that would alienate either group

**Example conflict**:

```
Article: "The American Revolution was a fight for independence from British colonial rule."

Conservative Parent: "‚úÖ Good! Emphasizes patriotic narrative and American independence."

Liberal Parent: "‚ö†Ô∏è  Missing context: What about enslaved people who weren't granted liberty? 
                Consider adding: 'While the Revolution secured independence for colonists, 
                it did not extend these rights to enslaved Africans.'"

Secretary (consolidation): "Consider adding a nuanced note about the limitations of 
                           Revolutionary-era liberty to provide historical context while 
                           maintaining the patriotic narrative focus."
```

This conflict is **valuable** - it leads to more historically accurate and broadly acceptable content.

---

### Designing Effective Reviewer Personas

Based on the ComposableApp's implementation, here are **4 principles** for designing reviewer personas:

#### Principle 1: Clear, Focused Mandate

**Good** (Grammar Reviewer):
```
"You are a stickler for formal language in all school content."
```
- ‚úÖ Clear role (formal language expert)
- ‚úÖ Focused scope (grammar, not content)
- ‚úÖ Objective criteria

**Bad**:
```
"You are an expert in education who reviews content for quality."
```
- ‚ùå Vague role (what kind of quality?)
- ‚ùå Broad scope (overlaps with other reviewers)
- ‚ùå Subjective criteria

#### Principle 2: Specific Values and Priorities

**Good** (Conservative Parent):
```
"You want a focus on Western civilization and a more positive view of 
American/European history, and want to downplay aspects like the history 
of slavery, colonialism..."
```
- ‚úÖ Explicit values (patriotic, traditional)
- ‚úÖ Specific priorities (Western civ, positive framing)
- ‚úÖ Clear stance on controversial topics

**Bad**:
```
"You care about traditional values."
```
- ‚ùå Vague values (what traditions?)
- ‚ùå No specific priorities
- ‚ùå Unclear how to apply

#### Principle 3: Representative of Real Stakeholders

**Good** (Liberal Parent):
```
"You value the connection between history and contemporary issues, 
highlighting themes of power, liberty, and individual rights."
```
- ‚úÖ Reflects actual liberal parent concerns
- ‚úÖ Based on real-world feedback patterns
- ‚úÖ Predictive of actual controversies

**Bad**:
```
"You want content to be progressive and modern."
```
- ‚ùå Stereotype, not representative
- ‚ùå Not based on real feedback
- ‚ùå Won't catch actual issues

#### Principle 4: Productive Tension

**Good** (Conservative vs. Liberal):
```
Conservative: "Focus on traditional algorithms"
Liberal: "Emphasize real-world contexts"
```
- ‚úÖ Genuine disagreement
- ‚úÖ Both valid perspectives
- ‚úÖ Forces better solutions (balance both)

**Bad**:
```
Reviewer A: "Make content short"
Reviewer B: "Make content long"
```
- ‚ùå Artificial conflict
- ‚ùå No underlying values
- ‚ùå Doesn't lead to better content

---

### Exercise: Design Your Own Reviewer Persona

Try designing a new reviewer for a different domain. Example: **STEM Equity Advocate**

```python
persona = \"\"\"
You are a STEM equity advocate who wants to ensure that science and math content is 
accessible to all students regardless of background.

Priorities:
- Examples should reflect diverse cultures and communities
- Experiments should not require expensive materials unavailable to low-income schools
- Language should avoid jargon and be accessible to English language learners
- Real-world applications should connect to varied student experiences, not just suburban contexts

Red flags:
- Examples that assume access to technology, travel, or expensive hobbies
- Experiments requiring specialized equipment not available in all schools
- Cultural references that alienate non-Western or non-middle-class students
- Language that favors native English speakers
\"\"\"
```

**Questions to ask**:
1. ‚úÖ Does this represent a real stakeholder?
2. ‚úÖ Are the priorities specific and actionable?
3. ‚úÖ Will this catch issues others miss?
4. ‚úÖ Does this create productive tension with other reviewers?

---

In [None]:
# Adversarial reviewer prompts
print("=" * 70)
print("ADVERSARIAL REVIEWERS (Subjective, Value-Based)")
print("=" * 70)

print("\n1. CONSERVATIVE PARENT REVIEWER")
print("-" * 70)
conservative_prompt = (prompts_dir / "conservative_parent_system_prompt.j2").read_text()
print(conservative_prompt.strip())

print("\n\n2. LIBERAL PARENT REVIEWER")
print("-" * 70)
liberal_prompt = (prompts_dir / "liberal_parent_system_prompt.j2").read_text()
print(liberal_prompt.strip())

print("\n\n3. SCHOOL ADMIN REVIEWER")
print("-" * 70)
if (prompts_dir / "school_admin_system_prompt.j2").exists():
    admin_prompt = (prompts_dir / "school_admin_system_prompt.j2").read_text()
    print(admin_prompt.strip())
else:
    print("(Note: school_admin_system_prompt.j2 not found - using SCHOOL_ADMIN as secretary)")

In [None]:
# Read system prompts for different reviewer types
prompts_dir = composable_app_path / "prompts"

# Specialist reviewer prompts
print("=" * 70)
print("SPECIALIST REVIEWERS (Objective, Technical)")
print("=" * 70)

print("\n1. GRAMMAR REVIEWER")
print("-" * 70)
grammar_prompt = (prompts_dir / "grammar_reviewer_system_prompt.j2").read_text()
print(grammar_prompt.strip())

print("\n\n2. DISTRICT REP REVIEWER")
print("-" * 70)
district_prompt = (prompts_dir / "district_rep_system_prompt.j2").read_text()
print(district_prompt.strip())

## Part 4: Specialist vs. Adversarial Reviewer Design

The power of multi-agent systems comes from **diversity of perspectives**. The ReviewerPanel uses two types of agents with very different roles.

### The Two Types of Reviewers

#### Type 1: Specialist Reviewers (Technical Experts)

**Goal**: Ensure accuracy, quality, and compliance

**Personas**:
1. **GrammarReviewer** - Formal language expert
2. **DistrictRepReviewer** - Budget and clarity focus  
3. *(MathReviewer is not in current enum but would be a specialist)*

**Characteristics**:
- ‚úÖ Objective, measurable criteria
- ‚úÖ Domain expertise (grammar, curriculum standards)
- ‚úÖ Focus on correctness and quality
- ‚úÖ Non-controversial feedback

#### Type 2: Adversarial Reviewers (Stakeholder Perspectives)

**Goal**: Identify potential controversies and concerns

**Personas**:
1. **ConservativeParentReviewer** - Traditional values, patriotic narratives
2. **LiberalParentReviewer** - Critical thinking, diverse perspectives
3. **SchoolAdminReviewer** - Legal/liability, budget concerns

**Characteristics**:
- ‚úÖ Subjective, value-based criteria
- ‚úÖ Stakeholder representation (parents, administrators)
- ‚úÖ Focus on acceptance and controversy avoidance
- ‚úÖ Often conflicting feedback

---

### Examining System Prompts

Let's look at the actual system prompts that define each reviewer's persona:

### Key Takeaways: Parallel Execution

‚úÖ **`asyncio.gather()`** runs multiple async functions concurrently
‚úÖ **Performance**: 5 reviewers in parallel = ~5x faster than sequential
‚úÖ **`return_exceptions=True`**: Continue despite individual failures
‚úÖ **Filter results**: Separate successful reviews from exceptions
‚úÖ **Use case**: Perfect for I/O-bound operations (LLM API calls, database queries)

**When to use parallel vs. sequential**:
- ‚úÖ **Parallel**: Independent tasks (reviewers don't need each other's output in Round 1)
- ‚ùå **Sequential**: Dependent tasks (Round 2 reviewers need to see Round 1 feedback)

---

### Real-World Performance

In production with real LLM API calls:
- **Sequential**: 6 reviewers √ó 5s each = **30 seconds**
- **Parallel**: 6 reviewers = **5-6 seconds** (6x speedup)
- **Cost**: Same (6 API calls either way)
- **User experience**: Much better (5s vs 30s wait time)

The ComposableApp **currently uses sequential execution** in Round 1 and Round 2 (as shown in the code). This is a deliberate choice to:
1. Simplify debugging (linear execution)
2. Avoid rate limiting issues
3. Make token usage more predictable

**Exercise for you**: Modify `do_first_round_reviews()` to use `asyncio.gather()` for parallel execution!

---

In [None]:
# Demo: WITH return_exceptions=True (robust)
print("WITH return_exceptions=True (robust)")
print("=" * 60)

results = await asyncio.gather(
    mock_review_with_failure("Grammar", should_fail=False),
    mock_review_with_failure("Math", should_fail=True),  # This one fails
    mock_review_with_failure("District", should_fail=False),
    return_exceptions=True  # ‚úÖ Return exceptions as values
)

print(f"\nüìä Results returned: {len(results)} items")
for i, result in enumerate(results):
    if isinstance(result, Exception):
        print(f"  {i+1}. ‚ùå Exception: {result}")
    else:
        print(f"  {i+1}. ‚úÖ Success: {result}")

# Filter out exceptions
successful_reviews = [r for r in results if not isinstance(r, Exception)]
failed_reviews = [r for r in results if isinstance(r, Exception)]

print(f"\n‚úÖ Successful reviews: {len(successful_reviews)}")
print(f"‚ùå Failed reviews: {len(failed_reviews)}")
print(f"\nüéØ Key benefit: Got 2 out of 3 reviews despite 1 failure!")

In [None]:
# Mock review function that sometimes fails
async def mock_review_with_failure(reviewer_name: str, should_fail: bool = False) -> str:
    """Simulate a reviewer that might fail."""
    print(f"  ‚è≥ {reviewer_name} started reviewing...")
    await asyncio.sleep(1.0)
    
    if should_fail:
        print(f"  ‚ùå {reviewer_name} FAILED (simulated error)")
        raise ValueError(f"{reviewer_name} encountered an error during review")
    
    print(f"  ‚úÖ {reviewer_name} finished")
    return f"{reviewer_name} review: All good!"

# Demo: WITHOUT return_exceptions (default behavior)
print("WITHOUT return_exceptions=True (default)")
print("=" * 60)
try:
    results = await asyncio.gather(
        mock_review_with_failure("Grammar", should_fail=False),
        mock_review_with_failure("Math", should_fail=True),  # This one fails
        mock_review_with_failure("District", should_fail=False),
    )
    print(f"‚úÖ All reviews completed: {results}")
except Exception as e:
    print(f"‚ùå Exception raised: {e}")
    print(f"   Problem: Only got 1 review, lost the other 2!")

print("\n" + "="*60 + "\n")

### Handling Failures with return_exceptions=True

What happens if one reviewer fails? By default, `asyncio.gather()` will raise the first exception and cancel remaining tasks. This is problematic for reviews - we want to get as much feedback as possible even if one reviewer fails.

**Solution**: Use `return_exceptions=True` to continue despite failures.

In [None]:
# Parallel execution - all reviewers at once
print("PARALLEL EXECUTION (asyncio.gather)")
print("=" * 60)
start_time = time.time()

# Create tasks for all reviewers
tasks = [mock_review(reviewer, delay_seconds=2.0) for reviewer in reviewers]

# Execute all tasks in parallel
reviews_parallel = await asyncio.gather(*tasks)

total_time_parallel = time.time() - start_time

print(f"\nüìä Results:")
print(f"  Total time: {total_time_parallel:.2f} seconds")
print(f"  Reviews completed: {len(reviews_parallel)}")
print(f"  Average time per review: {total_time_parallel / len(reviewers):.2f}s")

# Compare performance
print(f"\nüöÄ PERFORMANCE COMPARISON:")
print(f"  Sequential: {total_time_sequential:.2f}s")
print(f"  Parallel:   {total_time_parallel:.2f}s")
print(f"  Speedup:    {total_time_sequential / total_time_parallel:.1f}x faster!")
print(f"  Time saved: {total_time_sequential - total_time_parallel:.2f}s ({(1 - total_time_parallel/total_time_sequential)*100:.0f}% reduction)")

### Parallel Execution with asyncio.gather()

In [None]:
# Sequential execution - one reviewer at a time
reviewers = ["Grammar", "Math", "District", "Conservative", "Liberal"]

print("SEQUENTIAL EXECUTION")
print("=" * 60)
start_time = time.time()

reviews_sequential = []
for reviewer in reviewers:
    review = await mock_review(reviewer, delay_seconds=2.0)
    reviews_sequential.append(review)

total_time_sequential = time.time() - start_time

print(f"\nüìä Results:")
print(f"  Total time: {total_time_sequential:.2f} seconds")
print(f"  Reviews completed: {len(reviews_sequential)}")
print(f"  Average time per review: {total_time_sequential / len(reviewers):.2f}s")

### Sequential Execution (Baseline)

In [None]:
# Mock review function (simulates API call without actually calling it)
async def mock_review(reviewer_name: str, delay_seconds: float = 2.0) -> str:
    """Simulate a reviewer taking time to review (without API call)."""
    print(f"  ‚è≥ {reviewer_name} started reviewing...")
    await asyncio.sleep(delay_seconds)  # Simulate API latency
    review_text = f"{reviewer_name} completed review: Article looks good overall."
    print(f"  ‚úÖ {reviewer_name} finished (took {delay_seconds}s)")
    return review_text

# Test the mock function
print("Testing mock_review function:")
print("=" * 60)
review = await mock_review("GrammarReviewer", delay_seconds=1.0)
print(f"\nReview output: {review}")

---

## Part 3: Parallel Execution with asyncio.gather()

Now let's see the key performance optimization: **parallel execution** of independent reviewers.

### The Problem: Sequential Execution is Slow

**Current implementation** (from the code above) runs reviewers **sequentially**:

```python
async def do_first_round_reviews(article, topic) -> list:
    review_panel = [ReviewerAgent(reviewer) for reviewer in list(Reviewer)[:-1]]
    first_round_reviews = list()
    
    for reviewer_agent in review_panel:  # ‚ùå Sequential loop
        review = await reviewer_agent.review(topic, article, reviews_so_far=[])
        first_round_reviews.append((reviewer_agent.reviewer, review))
    
    return first_round_reviews
```

**Time breakdown** (if each reviewer takes 5 seconds):
```
GrammarReviewer:        [====] 5s
MathReviewer:           [====] 5s
DistrictRepReviewer:    [====] 5s
ConservativeParent:     [====] 5s
LiberalParent:          [====] 5s

Total: 5 √ó 5s = 25 seconds ‚ùå
```

### The Solution: Parallel Execution

**Optimized version** using `asyncio.gather()`:

```python
async def do_first_round_reviews_parallel(article, topic) -> list:
    review_panel = [ReviewerAgent(reviewer) for reviewer in list(Reviewer)[:-1]]
    
    # Create all review tasks
    review_tasks = [
        reviewer_agent.review(topic, article, reviews_so_far=[])
        for reviewer_agent in review_panel
    ]
    
    # Execute all tasks in parallel ‚úÖ
    reviews = await asyncio.gather(*review_tasks)
    
    # Pair reviewers with their reviews
    first_round_reviews = [
        (reviewer_agent.reviewer_type(), review)
        for reviewer_agent, review in zip(review_panel, reviews)
    ]
    
    return first_round_reviews
```

**Time breakdown** (parallel):
```
GrammarReviewer:        [====]
MathReviewer:           [====]
DistrictRepReviewer:    [====]  All run simultaneously
ConservativeParent:     [====]
LiberalParent:          [====]

Total: max(5s) = 5 seconds ‚úÖ 5x faster!
```

### Understanding asyncio.gather()

**`asyncio.gather(*tasks)`** runs multiple async functions concurrently:

```python
# Sequential (one after another)
result1 = await func1()  # Wait for func1
result2 = await func2()  # Wait for func2
result3 = await func3()  # Wait for func3
# Total: time(func1) + time(func2) + time(func3)

# Parallel (all at once)
results = await asyncio.gather(
    func1(),  # Start all three immediately
    func2(),
    func3(),
)
# Total: max(time(func1), time(func2), time(func3))
```

**Key parameters**:
- **`*tasks`** - Unpacks list of coroutines
- **`return_exceptions=True`** - Continue if one task fails (we'll demo this)

---

### Demo: Simulating Parallel Review

Let's create a **mock review function** that simulates the time each reviewer takes, without actually calling the API:



### Component 3: Review Orchestration Workflow

The `get_panel_review_of_article()` function orchestrates the entire multi-agent review process. It uses a **two-round review** pattern:

**Code Reference**: [`composable_app/agents/reviewer_panel.py:100-131`](../../agents/reviewer_panel.py#L100-L131)

```python
async def get_panel_review_of_article(topic: str, article: Article) -> str:
    # Round 1: Independent reviews
    first_round_reviews = await do_first_round_reviews(article, topic)
    
    # Round 2: Reviews after seeing others' feedback
    final_reviews = await do_second_round_reviews(article, first_round_reviews, topic)
    
    # Round 3: Secretary consolidates
    return await summarize_reviews(article, final_reviews, topic)
```

#### Round 1: Independent Reviews

```python
async def do_first_round_reviews(article, topic) -> list:
    # Each reviewer evaluates independently (no knowledge of others)
    review_panel = [ReviewerAgent(reviewer) for reviewer in list(Reviewer)[:-1]]
    
    first_round_reviews = list()
    for reviewer_agent in review_panel:
        review = await reviewer_agent.review(topic, article, reviews_so_far=[])
        first_round_reviews.append((reviewer_agent.reviewer, review))
    
    return first_round_reviews
```

**Why independent first?**
- ‚úÖ Prevents groupthink (reviewers not biased by others)
- ‚úÖ Captures diverse perspectives
- ‚úÖ Each reviewer focuses on their specialty

#### Round 2: Informed Reviews

```python
async def do_second_round_reviews(article, first_round_reviews, topic) -> list:
    # Each reviewer can now see what others said
    review_panel = [ReviewerAgent(reviewer) for reviewer in list(Reviewer)[:-1]]
    
    final_reviews = list()
    for reviewer_agent in review_panel:
        # Pass first_round_reviews so they can respond to others
        review = await reviewer_agent.review(topic, article, first_round_reviews)
        final_reviews.append((reviewer_agent.reviewer_type(), review))
    
    return final_reviews
```

**Why second round?**
- ‚úÖ Reviewers can address each other's points
- ‚úÖ Conflicts surface (e.g., conservative vs. liberal)
- ‚úÖ More nuanced feedback (\"I agree with Grammar but...\")

In [None]:
# Create a PanelSecretary
from composable_app.agents.reviewer_panel import PanelSecretary

secretary = PanelSecretary()

print(f"‚úÖ Created Secretary: {secretary.name()}")
print(f"\\nüìù Secretary's role:")
print(f"  - Receives reviews from all 6 reviewers")
print(f"  - Consolidates into single prioritized summary")
print(f"  - Resolves conflicts between reviewers")
print(f"  - Provides actionable guidance to writer")
print(f"\\nüîß Configuration:")
print(f"  - Model: {llms.DEFAULT_MODEL}")
print(f"  - System prompt: prompts/secretary_system_prompt.j2")
print(f"  - Output type: str (consolidated review text)")

### Component 2: PanelSecretary

The **PanelSecretary** consolidates feedback from all reviewers into a single, prioritized summary.

**Code Reference**: [`composable_app/agents/reviewer_panel.py:65-98`](../../agents/reviewer_panel.py#L65-L98)

```python
class PanelSecretary:
    def __init__(self):
        self.id = f"PanelSecretary {uuid.uuid4()}"
        system_prompt = PromptService.render_prompt("secretary_system_prompt")
        
        self.agent = Agent(
            llms.DEFAULT_MODEL,
            output_type=str,
            model_settings=llms.default_model_settings(),
            retries=2,
            system_prompt=system_prompt
        )
    
    async def consolidate(
        self, 
        topic: str, 
        article: Article, 
        reviews_so_far: List[Tuple[Reviewer, str]]
    ) -> str:
        # Format all reviews
        reviews_text = []
        for reviewer, review in reviews_so_far:
            reviews_text.append(f"BEGIN review by {reviewer.name}:\\n{review}\\nEND review\\n")
        
        # Create consolidation prompt
        prompt_vars = {
            "prompt_name": "Secretary_consolidate_reviews",
            "topic": topic,
            "article": article,
            "reviews": reviews_text
        }
        
        prompt = PromptService.render_prompt(**prompt_vars)
        result = await self.agent.run(prompt)
        
        # Log consolidated review
        await evals.record_ai_response(
            "consolidated_review",
            ai_input=prompt_vars,
            ai_response=result.output
        )
        
        return result.output
```

**Secretary's responsibilities**:
1. **Synthesize** - Combine related feedback (e.g., grammar + style issues)
2. **Prioritize** - Rank by importance (CRITICAL ‚Üí HIGH ‚Üí OPTIONAL)
3. **Resolve conflicts** - When reviewers disagree, provide balanced guidance
4. **Simplify** - Convert technical feedback into clear action items

The secretary receives **all 6 reviews** and produces a **single consolidated review** for the writer.

In [None]:
# Create individual reviewers
from composable_app.agents.reviewer_panel import ReviewerAgent

# Specialist reviewers
grammar_reviewer = ReviewerAgent(Reviewer.GRAMMAR_REVIEWER)
district_rep = ReviewerAgent(Reviewer.DISTRICT_REP)

# Adversarial reviewers
conservative_parent = ReviewerAgent(Reviewer.CONSERVATIVE_PARENT)
liberal_parent = ReviewerAgent(Reviewer.LIBERAL_PARENT)

print("‚úÖ Created 4 reviewers:")
print(f"  1. {grammar_reviewer.name()}")
print(f"  2. {district_rep.name()}")
print(f"  3. {conservative_parent.name()}")
print(f"  4. {liberal_parent.name()}")

print(f"\\nüìã Each reviewer has:")
print(f"  - Unique ID (with UUID)")
print(f"  - Reviewer type enum")
print(f"  - Pydantic AI agent (configured with {llms.DEFAULT_MODEL})")
print(f"  - System prompt loaded from prompts/{Reviewer.GRAMMAR_REVIEWER.name.lower()}_system_prompt.j2")

### Creating Individual Reviewers

Let's create a few reviewers and examine their personas:

**Note**: We won't actually call the review methods yet (to avoid API costs). We'll just inspect the reviewer configuration.

In [None]:
# Create a sample article to review
sample_article = Article(
    title="Photosynthesis: How Plants Make Food",
    summary="An explanation of the photosynthesis process for 9th grade students.",
    full_text="""
Photosynthesis is the process by which plants make their own food using sunlight, water, and carbon dioxide.
The equation for photosynthesis is: 6CO‚ÇÇ + 6H‚ÇÇO + light energy ‚Üí C‚ÇÜH‚ÇÅ‚ÇÇO‚ÇÜ + 6O‚ÇÇ

Plants contain chlorophyll in their leaves, which captures light energy from the sun. This energy is used to 
convert carbon dioxide from the air and water from the soil into glucose (a type of sugar) and oxygen.

The glucose provides energy for the plant to grow, while the oxygen is released into the atmosphere as a 
byproduct. This is why forests are often called the "lungs of the Earth" - they produce oxygen that we breathe!
    """.strip(),
    keywords=["photosynthesis", "chlorophyll", "glucose", "oxygen", "carbon dioxide"]
)

print("Sample Article Created:")
print("=" * 60)
print(f"Title: {sample_article.title}")
print(f"Summary: {sample_article.summary}")
print(f"\\nFull text ({len(sample_article.full_text)} characters):")
print(sample_article.full_text[:200] + "...")
print(f"\\nKeywords: {', '.join(sample_article.keywords)}")

### Component 1: ReviewerAgent Class

The `ReviewerAgent` class represents a single reviewer. Let's look at its key features:

**Code Reference**: [`composable_app/agents/reviewer_panel.py:25-63`](../../agents/reviewer_panel.py#L25-L63)

```python
class ReviewerAgent:
    def __init__(self, reviewer: Reviewer):
        self.reviewer = reviewer
        self.id = f"{reviewer} Agent {uuid.uuid4()}"
        
        # Load reviewer-specific system prompt
        system_prompt_file = f"{reviewer.name}_system_prompt".lower()
        system_prompt = PromptService.render_prompt(system_prompt_file)
        
        # Create Pydantic AI agent
        self.agent = Agent(
            llms.DEFAULT_MODEL,
            output_type=str,
            model_settings=llms.default_model_settings(),
            retries=2,
            system_prompt=system_prompt
        )
    
    async def review(self, topic: str, article: Article, reviews_so_far: List[Tuple[Reviewer, str]]) -> str:
        # Build prompt with article and previous reviews
        reviews_text = []
        for reviewer, review in reviews_so_far:
            reviews_text.append(f"BEGIN review by {reviewer.name}:\\n{review}\\nEND review\\n")
        
        prompt_vars = {
            "prompt_name": "ReviewerAgent_review_prompt",
            "topic": topic,
            "article": article,
            "reviews": reviews_text
        }
        
        # Generate review
        prompt = PromptService.render_prompt(**prompt_vars)
        result = await self.agent.run(prompt)
        
        # Log for evaluation
        await evals.record_ai_response(
            f"{self.reviewer.name}_review",
            ai_input=prompt_vars,
            ai_response=result.output
        )
        
        return result.output
```

**Key design decisions**:

1. **Enum-based reviewer types** - Uses `Reviewer` enum for type safety
2. **Dynamic system prompts** - Each reviewer loads its own persona from `prompts/{reviewer}_system_prompt.j2`
3. **Reviews awareness** - Can see previous reviews (`reviews_so_far`) for multi-round review
4. **Evaluation logging** - Records all reviews to `logs/evals.log` for analysis

Let's create a sample reviewer to see how it works:

In [None]:
# Let's examine the Reviewer enum (defines the 6 reviewer types)
from composable_app.agents.reviewer_panel import Reviewer

print("Available Reviewer Types:")
print("=" * 50)
for reviewer in Reviewer:
    print(f"  {reviewer.value}. {reviewer.name}")
    
print(f"\nTotal reviewers: {len(list(Reviewer))}")
print("\nNote: SCHOOL_ADMIN is used as the secretary, not as a reviewer")

## Part 2: ReviewerPanel Architecture

Now let's examine the actual code that implements the multi-agent pattern. We'll look at the ReviewerPanel architecture step by step.

### Overview: ReviewerPanel Components

The ReviewerPanel consists of:
1. **ReviewerAgent** - Base class for individual reviewers
2. **PanelSecretary** - Consolidates feedback from all reviewers
3. **get_panel_review_of_article()** - Orchestrates the review workflow

Let's explore each component.