# Advanced Patterns: Provider Switching & Optimization (Composable App Tutorial)

## Learning Objectives
By completing this tutorial, you will:
- Understand Pydantic AI's provider abstraction layer
- Learn to switch between LLM providers (Gemini, OpenAI, Anthropic)
- Compare cost and latency trade-offs across providers
- Master parallel execution patterns with `asyncio.gather()`
- Optimize for cost and performance using model selection and temperature tuning

## Prerequisites
- **Python**: Advanced proficiency with async/await, context managers
- **LLM basics**: Understanding of temperature, tokens, API pricing
- **Setup**: API keys for Gemini (required), OpenAI/Anthropic (optional for comparisons)

## Estimated Time
40-45 minutes (reading + execution)

## Cost Estimate
‚ö†Ô∏è **Variable costs**: 
- **Gemini only**: $0.05-0.10 (demonstration examples)
- **Multi-provider comparison**: $0.20-0.50 (if testing OpenAI/Anthropic)
- **Tip**: Skip multi-provider sections to minimize costs

> **Book Reference**: This pattern is detailed in *Generative AI Design Patterns*
> (Lakshmanan & Hapke, 2025), Chapter 8: "Model Cascades" and Chapter 29: "Cost Optimization".

---

## Why Provider Switching?

**Task 5.1.2**: LLM provider switching with Pydantic AI abstraction

### The Provider Lock-In Problem

**Traditional approach** (tightly coupled to provider):
```python
# Bad: Direct OpenAI integration
import openai
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
```

**Problems**:
- ‚ùå **Vendor lock-in**: Changing providers requires rewriting code
- ‚ùå **No fallback**: If OpenAI is down, entire system fails
- ‚ùå **Hard to compare**: Can't A/B test different providers
- ‚ùå **Cost rigidity**: Can't dynamically choose cheaper model

### Pydantic AI Abstraction

**Better approach** (provider-agnostic):
```python
# Good: Pydantic AI abstraction
from pydantic_ai import Agent

# Switch provider by changing model string
agent = Agent('gemini-2.0-flash')           # Google
agent = Agent('gpt-4o')                     # OpenAI
agent = Agent('claude-3-5-sonnet-20241022') # Anthropic

# Same API regardless of provider
result = await agent.run(prompt)
```

**Benefits**:
- ‚úÖ **Portability**: Change providers with one line
- ‚úÖ **Resilience**: Implement fallback logic
- ‚úÖ **Flexibility**: A/B test providers in production
- ‚úÖ **Cost optimization**: Route queries to cheapest suitable model

### Real-World Use Cases

1. **Cost optimization**: Use Gemini Flash for simple tasks, GPT-4 for complex reasoning
2. **Failover**: If primary provider rate-limits, fall back to secondary
3. **Regional compliance**: Use different providers based on user location
4. **Quality comparison**: A/B test providers to find best for your use case
5. **Model cascades**: Try small model first, escalate to large model if needed

**Code Location**: [`utils/llms.py`](../../utils/llms.py) defines model constants

---

## Setup Cell

**Task 5.1.1**: Setup cell with imports, cost warning (if testing multiple providers)

In [None]:
# Add project root to path for imports
import sys
sys.path.insert(0, '../..')  # Navigate to composable_app/ root

# Load environment variables
from dotenv import load_dotenv
import os
load_dotenv('../../keys.env')

# Verify API keys
assert os.getenv('GEMINI_API_KEY'), "‚ùå GEMINI_API_KEY not found in keys.env"
print("‚úÖ Gemini API key loaded")

# Optional: Check for other providers
has_openai = bool(os.getenv('OPENAI_API_KEY'))
has_anthropic = bool(os.getenv('ANTHROPIC_API_KEY'))
print(f"OpenAI API key: {'‚úÖ' if has_openai else '‚ùå (multi-provider examples will be skipped)'}")
print(f"Anthropic API key: {'‚úÖ' if has_anthropic else '‚ùå (multi-provider examples will be skipped)'}")

# Standard library
import asyncio
import time
from typing import List, Dict, Any
from dataclasses import dataclass

# Pydantic AI
from pydantic_ai import Agent
from pydantic_ai.models.gemini import GeminiModelSettings

# Project imports
from agents.article import Article
from utils import llms

print("\n‚úÖ Setup complete")
print("\n‚ö†Ô∏è Cost Warning:")
print("   - Gemini examples: ~$0.05-0.10")
print("   - Multi-provider comparisons: ~$0.20-0.50")
print("   - Skip optional sections to reduce costs")

---

## Provider Switching Basics

**Task 5.1.3**: Code section - Switching from Gemini to OpenAI/Anthropic

### Supported Providers in Pydantic AI

| Provider | Model String Examples | Strengths | Pricing (per 1M tokens) |
|----------|----------------------|-----------|------------------------|
| **Google Gemini** | `gemini-2.0-flash`<br>`gemini-2.5-flash-lite-preview-06-17` | Fast, cheap, multimodal | Input: $0.075<br>Output: $0.30 |
| **OpenAI** | `gpt-4o`<br>`gpt-4o-mini` | High quality, reliable | Input: $2.50<br>Output: $10.00 |
| **Anthropic** | `claude-3-5-sonnet-20241022`<br>`claude-3-5-haiku-20241022` | Long context, reasoning | Input: $3.00<br>Output: $15.00 |
| **Ollama** | `llama3:8b`<br>`mistral:7b` | Free, local, private | $0 (self-hosted) |

**Note**: Prices as of 2025. Check provider websites for latest pricing.

### Basic Provider Switching

In [None]:
# Define a simple task
topic = "Photosynthesis"
prompt = f"Explain {topic} in exactly 2 sentences for a 5th grader."

# Approach 1: Gemini (current default)
async def test_gemini():
    agent = Agent(
        'gemini-2.0-flash',
        system_prompt="You are a K-12 science educator."
    )
    result = await agent.run(prompt)
    return result.data

# Approach 2: OpenAI (if available)
async def test_openai():
    if not has_openai:
        return "‚ö†Ô∏è OPENAI_API_KEY not configured"
    
    agent = Agent(
        'gpt-4o-mini',  # Cheaper variant
        system_prompt="You are a K-12 science educator."
    )
    result = await agent.run(prompt)
    return result.data

# Approach 3: Anthropic (if available)
async def test_anthropic():
    if not has_anthropic:
        return "‚ö†Ô∏è ANTHROPIC_API_KEY not configured"
    
    agent = Agent(
        'claude-3-5-haiku-20241022',  # Fast variant
        system_prompt="You are a K-12 science educator."
    )
    result = await agent.run(prompt)
    return result.data

# Test all providers
print(f"üìù Prompt: '{prompt}'\n")
print("=" * 80)

gemini_response = await test_gemini()
print(f"\nüü¢ Gemini 2.0 Flash:")
print(f"   {gemini_response}")

if has_openai:
    openai_response = await test_openai()
    print(f"\nüîµ OpenAI GPT-4o-mini:")
    print(f"   {openai_response}")
else:
    print(f"\nüîµ OpenAI: Skipped (no API key)")

if has_anthropic:
    anthropic_response = await test_anthropic()
    print(f"\nüü£ Anthropic Claude Haiku:")
    print(f"   {anthropic_response}")
else:
    print(f"\nüü£ Anthropic: Skipped (no API key)")

print("\n" + "=" * 80)
print("\nüí° Observation: Same interface, different models - this is provider abstraction!")

---

## Model Configuration in Composable App

**Task 5.1.4**: Code section - Model configuration in utils/llms.py

### Model Constants Strategy

The composable app defines model constants in `utils/llms.py` for centralized management:

```python
# From utils/llms.py:5-8
BEST_MODEL = "gemini-2.0-flash"               # Highest quality
DEFAULT_MODEL = "gemini-2.0-flash"            # Standard use
SMALL_MODEL = "gemini-2.5-flash-lite-preview-06-17"  # Fastest, cheapest
EMBED_MODEL = "text-embedding-004"            # Embeddings
```

### Why Centralize Configuration?

**Benefits**:
1. **Single source of truth**: Change model once, affects entire app
2. **Easy A/B testing**: Swap `BEST_MODEL` to compare providers
3. **Cost control**: Downgrade all agents to `SMALL_MODEL` during development
4. **Semantic naming**: `BEST_MODEL` is clearer than `gemini-2.0-flash` in code

**Usage in agents**:
```python
# From agents/generic_writer_agent.py:98
self.agent = Agent(llms.BEST_MODEL,  # Not hardcoded!
                   output_type=Article,
                   model_settings=llms.default_model_settings())
```

### Model Selection Strategy

| Use Case | Model Choice | Rationale |
|----------|-------------|----------|
| **Initial draft generation** | `BEST_MODEL` | Quality matters, runs once | 
| **Guardrails validation** | `SMALL_MODEL` | Simple yes/no, runs frequently |
| **Keyword extraction** | `DEFAULT_MODEL` | Balanced speed/quality |
| **Review panel (6 agents)** | `DEFAULT_MODEL` | Parallel execution, cost adds up |
| **Revision** | `BEST_MODEL` | Refinement needs quality |

**Cost calculation example**:
```python
# Workflow: 1 draft + 6 reviews + 1 revision
# Option A: All BEST_MODEL
cost_A = 8 * 0.30  # 8 calls @ $0.30 per 1M output tokens
# Option B: Smart selection (BEST for draft/revision, DEFAULT for reviews)
cost_B = 2 * 0.30 + 6 * 0.15  # 60% savings!
```

In [None]:
# Demo: Comparing BEST_MODEL vs SMALL_MODEL

print(f"üìä Current Model Configuration:\n")
print(f"   BEST_MODEL:    {llms.BEST_MODEL}")
print(f"   DEFAULT_MODEL: {llms.DEFAULT_MODEL}")
print(f"   SMALL_MODEL:   {llms.SMALL_MODEL}")
print(f"   EMBED_MODEL:   {llms.EMBED_MODEL}")

# Test both models on same task
prompt = "List 3 keywords for an article about Photosynthesis."

async def compare_models():
    # BEST_MODEL (high quality)
    best_agent = Agent(llms.BEST_MODEL)
    best_result = await best_agent.run(prompt)
    
    # SMALL_MODEL (high speed)
    small_agent = Agent(llms.SMALL_MODEL)
    small_result = await small_agent.run(prompt)
    
    return best_result.data, small_result.data

best_keywords, small_keywords = await compare_models()

print(f"\nüìù Prompt: '{prompt}'\n")
print(f"üèÜ BEST_MODEL ({llms.BEST_MODEL}):")
print(f"   {best_keywords}")
print(f"\n‚ö° SMALL_MODEL ({llms.SMALL_MODEL}):")
print(f"   {small_keywords}")

print("\nüí° When to use each:")
print("   - BEST_MODEL: User-facing outputs, complex reasoning, final drafts")
print("   - SMALL_MODEL: Internal tasks, simple classification, guardrails")
print("   - DEFAULT_MODEL: General purpose, balanced cost/quality")

---

## Provider-Specific Settings

**Task 5.1.5**: Code section - Provider-specific settings (temperature, safety)

### Model Settings in Pydantic AI

Each provider has unique configuration options:

#### Gemini Settings
```python
# From utils/llms.py:21-32
from pydantic_ai.models.gemini import GeminiModelSettings

model_settings = GeminiModelSettings(
    temperature=0.25,  # Low temp for consistency (0.0-2.0)
    gemini_safety_settings=[
        {
            'category': 'HARM_CATEGORY_DANGEROUS_CONTENT',
            'threshold': 'BLOCK_ONLY_HIGH',  # Permissive for educational content
        }
    ]
)
```

**Why temperature=0.25?**
- K-12 content needs **consistency** (same explanation each time)
- Too low (0.0): Repetitive, robotic
- Too high (1.0+): Creative but inconsistent, may hallucinate
- Sweet spot (0.2-0.3): Consistent but natural-sounding

**Why permissive safety settings?**
- Educational content discusses topics like "cell death", "nuclear reactions"
- Default settings may over-block scientific terms
- `BLOCK_ONLY_HIGH`: Allows educational content, blocks actual harmful content

#### OpenAI Settings
```python
from pydantic_ai.models.openai import OpenAIModelSettings

openai_settings = OpenAIModelSettings(
    temperature=0.25,
    max_tokens=2000,      # Control output length
    top_p=0.9,            # Nucleus sampling (alternative to temperature)
    frequency_penalty=0.0, # Reduce repetition (0.0-2.0)
    presence_penalty=0.0   # Encourage new topics (0.0-2.0)
)
```

#### Anthropic Settings
```python
from pydantic_ai.models.anthropic import AnthropicModelSettings

anthropic_settings = AnthropicModelSettings(
    temperature=0.25,
    max_tokens=2000,
    top_p=0.9,
    top_k=50              # Limits sampling pool (Claude-specific)
)
```

In [None]:
# Demo: Temperature effects

async def test_temperature(temperature: float, num_runs: int = 3):
    """Test how temperature affects consistency."""
    agent = Agent(
        llms.DEFAULT_MODEL,
        model_settings=GeminiModelSettings(temperature=temperature)
    )
    
    prompt = "Name 1 example of photosynthesis in nature."
    
    responses = []
    for i in range(num_runs):
        result = await agent.run(prompt)
        responses.append(result.data)
    
    return responses

print("üß™ Testing temperature effects on consistency:\n")
print(f"Prompt: 'Name 1 example of photosynthesis in nature.'")
print(f"Running 3 times with different temperatures...\n")

# Test low temperature (consistent)
low_temp_responses = await test_temperature(0.0)
print("‚ùÑÔ∏è Temperature = 0.0 (most deterministic):")
for i, resp in enumerate(low_temp_responses, 1):
    print(f"   Run {i}: {resp[:80]}...")

# Test medium temperature (balanced)
med_temp_responses = await test_temperature(0.25)
print("\nüå°Ô∏è Temperature = 0.25 (composable app default):")
for i, resp in enumerate(med_temp_responses, 1):
    print(f"   Run {i}: {resp[:80]}...")

# Test high temperature (creative)
high_temp_responses = await test_temperature(1.0)
print("\nüî• Temperature = 1.0 (most creative):")
for i, resp in enumerate(high_temp_responses, 1):
    print(f"   Run {i}: {resp[:80]}...")

print("\nüí° Observation:")
print("   - Low temp (0.0): Almost identical responses (good for factual content)")
print("   - Medium temp (0.25): Slight variation (balanced for educational content)")
print("   - High temp (1.0): Diverse responses (good for creative tasks)")

print("\n‚ö†Ô∏è Recommendation: Use 0.2-0.3 for K-12 content (consistency with natural variation)")

In [None]:
# Demo: Sequential vs. Parallel execution

import time

async def simulate_llm_call(agent_name: str, delay: float = 2.0):
    """Simulate LLM API call with delay."""
    start = time.time()
    await asyncio.sleep(delay)  # Simulates API latency
    elapsed = time.time() - start
    return f"{agent_name} completed in {elapsed:.2f}s"

# Sequential execution (slow)
async def sequential_execution():
    """Run 6 agents one after another."""
    agents = [f"Reviewer{i+1}" for i in range(6)]
    
    start_time = time.time()
    results = []
    
    for agent in agents:
        result = await simulate_llm_call(agent, delay=0.5)  # 0.5s each (faster for demo)
        results.append(result)
    
    total_time = time.time() - start_time
    return results, total_time

# Parallel execution (fast)
async def parallel_execution():
    """Run 6 agents simultaneously with asyncio.gather()."""
    agents = [f"Reviewer{i+1}" for i in range(6)]
    
    start_time = time.time()
    
    # Create all tasks
    tasks = [simulate_llm_call(agent, delay=0.5) for agent in agents]
    
    # Execute in parallel
    results = await asyncio.gather(*tasks)
    
    total_time = time.time() - start_time
    return results, total_time

# Run comparison
print("üêå Sequential Execution:\n")
seq_results, seq_time = await sequential_execution()
for result in seq_results:
    print(f"   {result}")
print(f"\n   Total time: {seq_time:.2f}s\n")

print("=" * 80)

print("\n‚ö° Parallel Execution with asyncio.gather():\n")
par_results, par_time = await parallel_execution()
for result in par_results:
    print(f"   {result}")
print(f"\n   Total time: {par_time:.2f}s\n")

print("=" * 80)

print(f"\nüìä Performance Comparison:")
print(f"   Sequential: {seq_time:.2f}s")
print(f"   Parallel:   {par_time:.2f}s")
print(f"   Speedup:    {seq_time / par_time:.1f}x faster! ‚ö°")

print("\nüí° Real-world application:")
print("   - ReviewerPanel: 6 reviewers in parallel (12s ‚Üí 2s)")
print("   - Guardrails: Multiple checks in parallel (3s ‚Üí 1s)")
print("   - A/B testing: Test multiple prompts simultaneously")

---

## Parallel Execution with asyncio.gather()

**Task 5.1.7**: Section - Parallel execution patterns with asyncio.gather()

### The Latency Problem

**Sequential execution** (slow):
```python
# Bad: 6 reviewers run one after another
reviews = []
for reviewer in reviewers:  # 6 iterations
    review = await reviewer.review(article)  # 2 seconds each
    reviews.append(review)
# Total time: 6 √ó 2s = 12 seconds
```

**Parallel execution** (fast):
```python
# Good: All 6 reviewers run simultaneously
tasks = [reviewer.review(article) for reviewer in reviewers]
reviews = await asyncio.gather(*tasks)
# Total time: max(2s, 2s, 2s, 2s, 2s, 2s) = 2 seconds
```

**Speedup**: 6√ó faster!

### How asyncio.gather() Works

```python
async def task1():
    await asyncio.sleep(2)
    return "Task 1 done"

async def task2():
    await asyncio.sleep(2)
    return "Task 2 done"

# Sequential (4 seconds total)
result1 = await task1()  # Wait 2s
result2 = await task2()  # Wait 2s more

# Parallel (2 seconds total)
result1, result2 = await asyncio.gather(task1(), task2())  # Wait 2s total
```

**Key insight**: While waiting for API response, Python can start other tasks

### Composable App Example: ReviewerPanel

```python
# From agents/reviewer_panel.py (simplified)
async def review_article(self, article: Article) -> str:
    # Create 6 review tasks (all different personas)
    tasks = [
        self.grammar_reviewer.review(article),
        self.math_reviewer.review(article),
        self.conservative_parent.review(article),
        self.liberal_parent.review(article),
        self.school_admin.review(article),
        self.district_rep.review(article),
    ]
    
    # Execute all 6 in parallel
    reviews = await asyncio.gather(*tasks)
    
    # Consolidate feedback
    consolidated = await self.secretary.consolidate(reviews)
    return consolidated
```

**Performance**: 2-3 seconds (parallel) vs. 12-18 seconds (sequential)

---

## Cost Comparison Across Providers

**Task 5.1.6**: Section - Cost comparison across providers (table with pricing)

### Pricing Breakdown (January 2025)

**Note**: Prices change frequently. Always check provider websites for latest rates.

| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window | Speed |
|----------|-------|----------------------|------------------------|----------------|-------|
| **Gemini** | `gemini-2.0-flash` | $0.075 | $0.30 | 1M tokens | Very Fast |
| | `gemini-2.5-flash-lite` | $0.04 | $0.15 | 128K tokens | Fastest |
| **OpenAI** | `gpt-4o` | $2.50 | $10.00 | 128K tokens | Fast |
| | `gpt-4o-mini` | $0.15 | $0.60 | 128K tokens | Fast |
| **Anthropic** | `claude-3-5-sonnet` | $3.00 | $15.00 | 200K tokens | Medium |
| | `claude-3-5-haiku` | $0.80 | $4.00 | 200K tokens | Fast |
| **Ollama** | `llama3:8b` (local) | $0.00 | $0.00 | 8K tokens | Slow (CPU) |

### Real-World Cost Analysis: Composable App Workflow

**Scenario**: Generate 1 article
- TaskAssigner: 100 tokens in, 50 tokens out
- Writer (initial draft): 500 tokens in, 1000 tokens out
- ReviewerPanel (6 reviewers): 1500 tokens in √ó 6, 300 tokens out √ó 6
- Writer (revision): 2500 tokens in, 1200 tokens out

**Total tokens**: ~13K input, ~3.8K output

#### Cost per Article by Provider

| Provider | Model | Input Cost | Output Cost | **Total** |
|----------|-------|-----------|-------------|----------|
| **Gemini** | `gemini-2.0-flash` | $0.00098 | $0.00114 | **$0.0021** |
| **OpenAI** | `gpt-4o-mini` | $0.00195 | $0.00228 | **$0.0042** |
| **OpenAI** | `gpt-4o` | $0.0325 | $0.038 | **$0.0705** |
| **Anthropic** | `claude-3-5-haiku` | $0.0104 | $0.0152 | **$0.0256** |
| **Anthropic** | `claude-3-5-sonnet` | $0.039 | $0.057 | **$0.096** |

### Cost Optimization Strategies

#### 1. **Model Cascades** (Pattern 8)
Try cheap model first, escalate if needed:
```python
# Attempt 1: Try SMALL_MODEL (fast, cheap)
result = await small_agent.run(prompt)

# Attempt 2: If quality insufficient, escalate to BEST_MODEL
if quality_check(result) < 0.7:
    result = await best_agent.run(prompt)
```

**Savings**: 70-90% when small model succeeds

#### 2. **Smart Routing**
Route tasks to appropriate model tier:
```python
# Simple tasks ‚Üí SMALL_MODEL
if task.type == "guardrail_check":
    agent = Agent(llms.SMALL_MODEL)

# Complex tasks ‚Üí BEST_MODEL
elif task.type == "initial_draft":
    agent = Agent(llms.BEST_MODEL)
```

#### 3. **Prompt Caching** (Pattern 25)
Reuse expensive context across calls:
```python
# Large context (e.g., book chapter) cached by provider
# First call: Pay for full context
# Subsequent calls: Only pay for new content

# Supported by: Anthropic Claude (90% cache discount), Google Gemini (context caching API)
```

#### 4. **Batch Processing**
Some providers offer batch API (slower, 50% cheaper):
```python
# OpenAI Batch API: 50% discount, 24hr turnaround
# Good for: Evaluation datasets, overnight processing
# Bad for: Real-time user-facing workflows
```

### When to Optimize for Cost vs. Quality

**Optimize for cost**:
- ‚úÖ Internal tools and automation
- ‚úÖ High-volume simple tasks (guardrails, classification)
- ‚úÖ Development and testing

**Optimize for quality**:
- ‚úÖ User-facing outputs
- ‚úÖ High-stakes decisions (medical, legal, financial)
- ‚úÖ Complex reasoning tasks

**Balance both**:
- ‚úÖ Educational content (composable app)
- ‚úÖ Most production applications
- ‚úÖ Use model cascades for best of both worlds

---

## Next Steps

### Continue Learning
1. **[Evaluation Tutorial](evaluation_tutorial.ipynb)** - Measure quality and iterate
2. **[Horizontal Services](../concepts/horizontal_services.md)** - Composable design patterns
3. **[Multi-Agent Pattern](multi_agent_pattern.ipynb)** - ReviewerPanel deep-dive

### Hands-On Practice
1. **Provider comparison**: Test all 3 providers (Gemini, OpenAI, Anthropic) on same task
2. **Cost optimization**: Implement model cascade (SMALL_MODEL ‚Üí BEST_MODEL)
3. **Parallel execution**: Convert a sequential workflow to parallel with asyncio.gather()
4. **Temperature tuning**: Find optimal temperature for your use case

### Advanced Challenges
1. **Implement failover logic**: Automatic provider switching on errors
2. **Cost tracking**: Log API costs per provider, visualize with Streamlit
3. **A/B testing framework**: Compare providers systematically with metrics
4. **Dynamic routing**: Route queries to providers based on complexity

### Production Considerations
1. **Monitoring**: Track latency, error rates, costs per provider
2. **Rate limiting**: Implement exponential backoff for API errors
3. **Caching**: Use provider-specific caching (Anthropic prompt caching, Gemini context caching)
4. **Security**: Store API keys in secrets manager, not .env files

---

**Congratulations!** You've learned advanced patterns for LLM optimization:
- Provider abstraction with Pydantic AI
- Cost optimization strategies (model cascades, smart routing)
- Performance optimization (parallel execution, temperature tuning)
- Resilience patterns (failover, retry logic)

**Tutorial Version**: 1.0  
**Last Updated**: 2025-11-05  
**Estimated Time**: 40-45 minutes  
**API Cost**: $0.05-0.50 (depending on providers tested)

---

## Self-Assessment

**Task 5.1.13**: Self-assessment questions with answers

### Question 1: Concept Check
**What is the main benefit of Pydantic AI's provider abstraction?**

<details>
<summary>Click to reveal answer</summary>

**Answer**: **Portability and flexibility** - Change LLM providers by modifying a single string, without rewriting code.

**Example**:
```python
# Switch from Gemini to OpenAI - same API
agent = Agent('gemini-2.0-flash')        # Google
agent = Agent('gpt-4o')                   # OpenAI  
agent = Agent('claude-3-5-sonnet-20241022')  # Anthropic
```

**Benefits**:
1. **No vendor lock-in**: Easy migration between providers
2. **Resilience**: Implement fallback logic if primary provider fails
3. **Cost optimization**: Route tasks to cheapest suitable model
4. **A/B testing**: Compare providers in production

**Real-world scenario**: If OpenAI raises prices, switch entire app to Gemini in minutes, not weeks.
</details>

---

### Question 2: Implementation
**Why does the composable app use `temperature=0.25` for educational content?**

<details>
<summary>Click to reveal answer</summary>

**Answer**: **Consistency** - Low temperature ensures similar explanations for same topic.

**Temperature effects**:
- **0.0**: Deterministic, almost identical outputs (too robotic)
- **0.2-0.3**: Consistent with natural variation (ideal for education)
- **0.7-1.0**: Creative, diverse outputs (good for brainstorming, bad for facts)
- **1.5+**: Chaotic, may hallucinate (avoid for production)

**Why it matters for K-12**:
- Students asking same question should get consistent answer
- Factual accuracy more important than creativity
- Parents/teachers expect predictable, reliable content

**Exception**: Use higher temperature (0.7-0.9) for creative writing assignments, poetry, or open-ended exploration.
</details>

---

### Question 3: Optimization
**You have a workflow with 6 parallel LLM calls (ReviewerPanel). Each call takes 2 seconds. What's the total time for sequential vs. parallel execution?**

<details>
<summary>Click to reveal answer</summary>

**Answer**:
- **Sequential**: 6 √ó 2s = **12 seconds**
- **Parallel** (with asyncio.gather): max(2s, 2s, 2s, 2s, 2s, 2s) = **2 seconds**
- **Speedup**: 6√ó faster ‚ö°

**Code**:
```python
# Sequential (slow)
reviews = []
for reviewer in reviewers:
    review = await reviewer.review(article)  # Wait for each
    reviews.append(review)
# Total: 12s

# Parallel (fast)
tasks = [reviewer.review(article) for reviewer in reviewers]
reviews = await asyncio.gather(*tasks)  # All at once
# Total: 2s
```

**Why it works**: While waiting for API response from Reviewer1, Python starts requests for Reviewer2-6 simultaneously.

**Limitation**: Speedup limited by slowest task (if one reviewer takes 5s, total = 5s, not 2s).
</details>

---

### Question 4: Cost Analysis
**Scenario**: Generate 1000 articles using composable app workflow (13K input tokens, 3.8K output tokens per article). Compare costs for Gemini Flash vs. GPT-4o.

<details>
<summary>Click to reveal answer</summary>

**Calculation**:

**Gemini 2.0 Flash**:
- Input: 1000 √ó 13K tokens √ó $0.075/1M = $0.975
- Output: 1000 √ó 3.8K tokens √ó $0.30/1M = $1.14
- **Total**: **$2.12**

**GPT-4o**:
- Input: 1000 √ó 13K tokens √ó $2.50/1M = $32.50
- Output: 1000 √ó 3.8K tokens √ó $10.00/1M = $38.00
- **Total**: **$70.50**

**Analysis**:
- GPT-4o is **33√ó more expensive** ($70.50 vs $2.12)
- For educational content, Gemini provides good quality at fraction of cost
- **Recommendation**: Use Gemini Flash for development/testing, evaluate GPT-4o only if quality insufficient

**Optimization**: Use model cascades
```python
# Try Gemini first (cheap)
result = await gemini_agent.run(prompt)

# Escalate to GPT-4 only if quality check fails
if quality_score(result) < 0.8:
    result = await gpt4_agent.run(prompt)
```

**Savings**: 70-90% when Gemini succeeds most of the time.
</details>

---

### Question 5: Advanced
**How would you implement automatic failover if the primary LLM provider is down?**

<details>
<summary>Click to reveal answer</summary>

**Approach: Try-Except with Provider Fallback**

```python
async def resilient_llm_call(prompt: str, max_retries: int = 2):
    \"\"\"Call LLM with automatic failover to backup provider.\"\"\"
    
    # Provider priority list
    providers = [
        ('gemini-2.0-flash', 'primary'),
        ('gpt-4o-mini', 'secondary'),
        ('claude-3-5-haiku-20241022', 'tertiary')
    ]
    
    last_error = None
    
    for model, tier in providers:
        for attempt in range(max_retries):
            try:
                agent = Agent(model)
                result = await agent.run(prompt)
                
                logger.info(f\"‚úÖ Success with {tier} provider: {model}")
                return result.data
                
            except Exception as e:
                last_error = e
                logger.warning(f\"‚ö†Ô∏è {tier} provider failed (attempt {attempt+1}): {e}")
                
                # Exponential backoff
                if attempt < max_retries - 1:
                    await asyncio.sleep(2 ** attempt)
    
    # All providers failed
    logger.error(f\"‚ùå All providers failed. Last error: {last_error}")
    raise Exception(f\"LLM call failed after trying all providers: {last_error}\")

# Usage
result = await resilient_llm_call("Explain photosynthesis")
```

**Key features**:
1. **Priority list**: Try cheapest/fastest first
2. **Retry logic**: 2 attempts per provider
3. **Exponential backoff**: Wait 1s, then 2s between retries
4. **Logging**: Track which provider succeeded for monitoring
5. **Graceful degradation**: Falls back to slower/more expensive providers

**Production enhancements**:
- **Circuit breaker**: Temporarily skip failing providers
- **Health checks**: Monitor provider uptime separately
- **Cost tracking**: Log which provider was used for billing
- **Quality monitoring**: Compare output quality across providers
</details>

---

## Common Pitfalls

**Task 5.1.12**: Common Pitfalls section

### ‚ùå Error: "Model not found" when switching providers

**Cause**: Typo in model name or using provider-specific model with wrong API key

**Example**:
```python
# Wrong: Gemini model name with OpenAI key
os.environ['OPENAI_API_KEY'] = "..."
agent = Agent('gemini-2.0-flash')  # Will fail!
```

**Solution**: Verify model name matches provider
```python
# Correct: Use OpenAI model with OpenAI key
agent = Agent('gpt-4o')  # ‚úÖ
```

---

### ‚ùå Error: "asyncio.gather() got multiple values for argument 'return_exceptions'"

**Cause**: Incorrect unpacking of tasks

**Example**:
```python
# Wrong
tasks = [task1(), task2()]
results = await asyncio.gather(tasks)  # Missing *
```

**Solution**: Use * to unpack list
```python
# Correct
tasks = [task1(), task2()]
results = await asyncio.gather(*tasks)  # ‚úÖ
```

---

### ‚ö†Ô∏è Warning: High costs from accidental BEST_MODEL usage

**Cause**: Using expensive model for simple tasks

**Example**:
```python
# Expensive: GPT-4 for simple yes/no check
agent = Agent('gpt-4o')  # $10/1M output tokens
for item in 10000_items:
    result = await agent.run(f"Is this spam? {item}")
# Cost: ~$100!
```

**Solution**: Use tiered model selection
```python
# Cheap: Use small model for classification
agent = Agent(llms.SMALL_MODEL)  # $0.15/1M output tokens
# Cost: ~$1.50 for same task
```

---

### ‚ö†Ô∏è Warning: Temperature too high causes inconsistent outputs

**Cause**: Temperature > 0.5 for factual content

**Example**:
```python
# Bad: High temperature for educational content
settings = GeminiModelSettings(temperature=1.5)
# Result: Same question gives wildly different answers
```

**Solution**: Use low temperature (0.2-0.3) for consistency
```python
# Good: Low temperature for consistent educational content
settings = GeminiModelSettings(temperature=0.25)
```

---

### üí° Tip: Cache agent instances for better performance

**Problem**: Creating new Agent() for every request is slow

**Bad**:
```python
for topic in topics:
    agent = Agent(llms.BEST_MODEL)  # Recreated each time!
    result = await agent.run(f"Write about {topic}")
```

**Better**:
```python
# Create once, reuse many times
agent = Agent(llms.BEST_MODEL)
for topic in topics:
    result = await agent.run(f"Write about {topic}")  # ‚úÖ Faster
```

---

### üí° Tip: Use environment variables for API keys, not hardcoded strings

**Bad**:
```python
os.environ['GEMINI_API_KEY'] = "AIza..."  # Hardcoded! Security risk
```

**Good**:
```python
# Store in keys.env file (git-ignored)
load_dotenv('keys.env')
# Keys loaded from file, not in code ‚úÖ
```

---

### üí° Tip: Monitor API quotas and rate limits

**Problem**: Hitting rate limits during high-volume processing

**Solutions**:
1. **Exponential backoff**: Retry with increasing delays
2. **Rate limiting**: Add `asyncio.sleep()` between batches
3. **Provider diversity**: Distribute load across multiple providers

```python
# Example: Rate limiting
for batch in chunks(items, batch_size=100):
    results = await asyncio.gather(*[process(item) for item in batch])
    await asyncio.sleep(1)  # Wait 1s between batches
```

In [None]:
# Demo: Profile real LLM calls

import time
from typing import Dict

async def profile_model(model_name: str, prompt: str) -> Dict[str, float]:
    """Profile a single model call."""
    agent = Agent(model_name)
    
    start_time = time.time()
    result = await agent.run(prompt)
    end_time = time.time()
    
    latency = end_time - start_time
    return {
        "model": model_name,
        "latency_seconds": latency,
        "response_length": len(result.data)
    }

# Test different models
prompt = "Explain photosynthesis in 1 sentence."

print("‚è±Ô∏è Profiling Model Performance:\n")
print(f"Prompt: '{prompt}'\n")
print("=" * 80)

# Profile BEST_MODEL
best_profile = await profile_model(llms.BEST_MODEL, prompt)
print(f"\nüèÜ {best_profile['model']}:")
print(f"   Latency: {best_profile['latency_seconds']:.2f}s")
print(f"   Response length: {best_profile['response_length']} chars")

# Profile SMALL_MODEL
small_profile = await profile_model(llms.SMALL_MODEL, prompt)
print(f"\n‚ö° {small_profile['model']}:")
print(f"   Latency: {small_profile['latency_seconds']:.2f}s")
print(f"   Response length: {small_profile['response_length']} chars")

print("\n" + "=" * 80)

# Compare
speedup = best_profile['latency_seconds'] / small_profile['latency_seconds']
print(f"\nüìä Analysis:")
print(f"   SMALL_MODEL is {speedup:.1f}x faster")
print(f"   Trade-off: Speed vs. quality (both produce valid answers)")

print("\nüí° Recommendation:")
print("   - Use SMALL_MODEL for time-sensitive tasks (guardrails, classification)")
print("   - Use BEST_MODEL for quality-critical tasks (user-facing content)")

---

## Performance Profiling

**Task 5.1.8**: Code section - Performance profiling with %%time magic

### Measuring Latency

**Why profile?**
- Identify bottlenecks in multi-agent workflows
- Compare provider performance
- Validate parallel execution benefits

**Tools**:
1. **Time module**: `time.time()` for programmatic timing
2. **Jupyter %%time**: Cell magic for quick profiling
3. **Profilers**: cProfile, py-spy for deep analysis