# Chain-of-Agents Part 4: PPO - The 18-20% Performance Breakthrough

**Time**: 60 minutes | **Level**: Advanced | **Prerequisite**: Parts 1-3

## The Final Frontier 🚀

Your AFM from SFT achieves 45% on GAIA. Good, but not great.

**With PPO (Proximal Policy Optimization)**: 55.3% on GAIA! 🎯

That's an **18-20% relative improvement**. This is the secret sauce that makes CoA beat all other multi-agent systems.

Let's build it from first principles.

## Step 1: Why Does PPO Help AFMs?

Let's understand the problem SFT leaves unsolved.

In [None]:
def demonstrate_sft_limitations():
    """Show why SFT alone isn't enough for AFM"""
    
    print("🤔 SFT LIMITATIONS FOR AFM")
    print("="*50)
    
    problems = [
        {
            "issue": "Imitation Learning Ceiling",
            "description": "AFM can only be as good as training trajectories",
            "example": "If all trajectories score 70%, AFM caps at 70%"
        },
        {
            "issue": "No Exploration",
            "description": "SFT just copies, never discovers new strategies",
            "example": "Never learns better agent combinations"
        },
        {
            "issue": "Reward Misalignment",
            "description": "Optimizes for copying, not for actual task success",
            "example": "Perfect imitation ≠ perfect performance"
        },
        {
            "issue": "Distribution Mismatch",
            "description": "Training data ≠ real world deployment",
            "example": "Works on seen tasks, fails on new ones"
        }
    ]
    
    for i, problem in enumerate(problems, 1):
        print(f"\n{i}. {problem['issue']}")
        print(f"   Problem: {problem['description']}")
        print(f"   Example: {problem['example']}")
    
    print("\n💡 THE SOLUTION: Reinforcement Learning")
    print("   Let the AFM explore and improve beyond training data!")
    print("   Reward actual task performance, not just imitation")

demonstrate_sft_limitations()

## Step 2: PPO in 5 Minutes

Quick crash course on PPO before we apply it to AFMs.

In [None]:
import numpy as np
import random

class SimplePPO:
    """PPO explained in the simplest terms"""
    
    def __init__(self):
        self.epsilon = 0.2  # Clipping parameter
        self.policy_old = None
        self.policy_new = None
    
    def explain_ppo(self):
        """Explain PPO conceptually"""
        
        print("🎯 PPO IN 5 MINUTES")
        print("="*40)
        
        print("\n1️⃣ THE GOAL:")
        print("   Improve policy (AFM) using rewards from environment")
        print("   Don't change too much at once (stability)")
        
        print("\n2️⃣ THE PROCESS:")
        print("   📊 Collect experience: (state, action, reward)")
        print("   🎯 Estimate advantages: how much better is this action?")
        print("   📈 Update policy: move toward better actions")
        print("   ✂️  Clip updates: don't change too drastically")
        
        print("\n3️⃣ THE MATH (simplified):")
        print("   ratio = new_policy(action) / old_policy(action)")
        print("   clipped_ratio = clip(ratio, 1-ε, 1+ε)")
        print("   loss = min(ratio * advantage, clipped_ratio * advantage)")
        
        print("\n4️⃣ FOR AFM:")
        print("   State: Task description")
        print("   Action: Generated agent response")
        print("   Reward: How well did it solve the task?")
        
        print("\n✨ RESULT: AFM learns to generate better agent chains!")
    
    def simulate_update(self, advantages):
        """Simulate a PPO update step"""
        
        print("\n🔄 PPO UPDATE SIMULATION")
        print("-"*30)
        
        # Simulate policy probabilities
        old_probs = np.random.rand(5)
        new_probs = old_probs + np.random.normal(0, 0.1, 5)
        new_probs = np.maximum(new_probs, 0.001)  # Avoid zeros
        
        ratios = new_probs / old_probs
        clipped_ratios = np.clip(ratios, 1 - self.epsilon, 1 + self.epsilon)
        
        print("Example PPO update:")
        for i in range(5):
            print(f"  Action {i}: ratio={ratios[i]:.3f}, "
                  f"clipped={clipped_ratios[i]:.3f}, "
                  f"advantage={advantages[i]:.2f}")
        
        return clipped_ratios

# Demonstrate PPO
ppo = SimplePPO()
ppo.explain_ppo()

# Simulate update
sample_advantages = np.array([0.5, -0.2, 0.8, -0.1, 0.3])
ppo.simulate_update(sample_advantages)

## Step 3: Design Reward Function for AFM

The reward function determines what the AFM learns to optimize.

In [None]:
class AFMRewardFunction:
    """Reward function for Agent Foundation Model training"""
    
    def __init__(self):
        self.weights = {
            'task_completion': 0.4,
            'agent_coherence': 0.2,
            'output_quality': 0.2,
            'efficiency': 0.1,
            'user_preference': 0.1
        }
    
    def evaluate_task_completion(self, task, output):
        """Did the AFM actually solve the task?"""
        # Simulate task completion check
        if "error" in output.lower() or "failed" in output.lower():
            return 0.0
        elif "complete" in output.lower() or "success" in output.lower():
            return 1.0
        else:
            return 0.5
    
    def evaluate_agent_coherence(self, output):
        """Do the simulated agents work well together?"""
        # Check for agent markers
        agent_markers = ['[Planner]', '[Coder]', '[Reviewer]']
        present_agents = sum(1 for marker in agent_markers if marker in output)
        
        # Reward having all agents participate
        coherence_score = present_agents / len(agent_markers)
        
        # Bonus for logical flow
        if '[Planner]' in output and '[Coder]' in output:
            plan_pos = output.find('[Planner]')
            code_pos = output.find('[Coder]')
            if plan_pos < code_pos:  # Logical order
                coherence_score += 0.2
        
        return min(coherence_score, 1.0)
    
    def evaluate_output_quality(self, output):
        """Is the output high quality?"""
        quality_indicators = [
            'detailed', 'comprehensive', 'thorough',
            'optimized', 'robust', 'scalable'
        ]
        
        negative_indicators = [
            'unclear', 'incomplete', 'buggy',
            'broken', 'inefficient'
        ]
        
        quality_score = 0.5  # Baseline
        
        for indicator in quality_indicators:
            if indicator in output.lower():
                quality_score += 0.1
        
        for indicator in negative_indicators:
            if indicator in output.lower():
                quality_score -= 0.1
        
        return max(0, min(quality_score, 1.0))
    
    def evaluate_efficiency(self, output):
        """Is the solution efficient?"""
        # Longer isn't always better, but too short is bad
        length = len(output.split())
        
        if length < 20:
            return 0.2  # Too brief
        elif 20 <= length <= 100:
            return 1.0  # Just right
        elif 100 < length <= 200:
            return 0.8  # A bit verbose
        else:
            return 0.5  # Too verbose
    
    def evaluate_user_preference(self, output):
        """Would users prefer this response?"""
        # Simulate user preference (in reality, this could be human feedback)
        positive_words = ['easy', 'clear', 'helpful', 'practical']
        score = 0.5
        
        for word in positive_words:
            if word in output.lower():
                score += 0.1
        
        return min(score, 1.0)
    
    def calculate_reward(self, task, output):
        """Calculate total reward for AFM output"""
        
        components = {
            'task_completion': self.evaluate_task_completion(task, output),
            'agent_coherence': self.evaluate_agent_coherence(output),
            'output_quality': self.evaluate_output_quality(output),
            'efficiency': self.evaluate_efficiency(output),
            'user_preference': self.evaluate_user_preference(output)
        }
        
        # Weighted sum
        total_reward = sum(self.weights[k] * components[k] for k in components)
        
        return total_reward, components

# Test the reward function
reward_fn = AFMRewardFunction()

# Example AFM outputs
good_output = """[Planner]: Breaking down the task into clear steps...
[Coder]: Implementing a robust and scalable solution...
[Reviewer]: Code review complete. The solution is comprehensive and ready."""

bad_output = """Error: I don't understand this task. The solution is unclear and broken."""

# Calculate rewards
print("🏆 REWARD FUNCTION EVALUATION")
print("="*50)

for i, (name, output) in enumerate([("Good Output", good_output), ("Bad Output", bad_output)], 1):
    reward, components = reward_fn.calculate_reward("Build API", output)
    
    print(f"\n{i}. {name}:")
    print(f"   Total Reward: {reward:.3f}")
    
    for component, score in components.items():
        bar = '█' * int(score * 10)
        print(f"   {component:15}: {bar:10} {score:.2f}")

## Step 4: PPO Training Loop for AFM

Put it all together: generate, evaluate, update.

In [None]:
class AFMPPO:
    """PPO training for Agent Foundation Model"""
    
    def __init__(self, afm_model, reward_function):
        self.afm = afm_model
        self.reward_fn = reward_function
        
        # PPO hyperparameters
        self.clip_epsilon = 0.2
        self.learning_rate = 3e-4
        self.ppo_epochs = 4
        self.batch_size = 64
        self.gamma = 0.99  # Discount factor
        
        # Training tracking
        self.episode_rewards = []
        self.performance_history = []
    
    def generate_response(self, task):
        """Generate AFM response (simulated)"""
        # In reality, this would be afm.generate(task)
        responses = [
            f"[Planner]: Analyzing '{task}' and creating detailed plan...\n[Coder]: Implementing solution with best practices...\n[Reviewer]: Review complete, solution approved.",
            f"[Planner]: Breaking down '{task}' into steps...\n[Coder]: Writing basic implementation...\n[Reviewer]: Looks okay.",
            f"Error: Could not process '{task}' properly. Solution incomplete.",
            f"[Planner]: Planning for '{task}'...\n[Coder]: def solution(): return 'done'\n[Reviewer]: Comprehensive solution ready."
        ]
        
        return random.choice(responses)
    
    def collect_trajectories(self, tasks, num_samples=100):
        """Collect experience trajectories"""
        trajectories = []
        
        print("📊 Collecting PPO trajectories...")
        
        for i in range(num_samples):
            task = random.choice(tasks)
            
            # Generate response
            response = self.generate_response(task)
            
            # Calculate reward
            reward, components = self.reward_fn.calculate_reward(task, response)
            
            # Store trajectory
            trajectories.append({
                'task': task,
                'response': response,
                'reward': reward,
                'components': components
            })
            
            if i % 20 == 0:
                print(f"  Collected {i+1}/{num_samples} trajectories")
        
        avg_reward = np.mean([t['reward'] for t in trajectories])
        print(f"✅ Average reward: {avg_reward:.3f}")
        
        return trajectories
    
    def calculate_advantages(self, trajectories):
        """Calculate advantages for PPO update"""
        rewards = [t['reward'] for t in trajectories]
        
        # Simple advantage calculation (rewards - baseline)
        baseline = np.mean(rewards)
        advantages = [r - baseline for r in rewards]
        
        # Normalize advantages
        advantages = np.array(advantages)
        advantages = (advantages - np.mean(advantages)) / (np.std(advantages) + 1e-8)
        
        return advantages
    
    def ppo_update(self, trajectories, advantages):
        """Perform PPO update (simplified)"""
        
        print("🔄 PPO Update Step")
        
        # In reality, this would be gradient descent on the PPO loss
        # For demo, we'll simulate the update
        
        positive_advantages = sum(1 for a in advantages if a > 0)
        negative_advantages = len(advantages) - positive_advantages
        
        print(f"   Positive advantages: {positive_advantages}/{len(advantages)}")
        print(f"   Policy update strength: {np.mean(np.abs(advantages)):.3f}")
        
        # Simulate model improvement
        improvement = np.random.uniform(0.01, 0.05)
        print(f"   Simulated improvement: +{improvement:.3f}")
        
        return improvement
    
    def train_epoch(self, tasks):
        """One epoch of PPO training"""
        
        # Collect trajectories
        trajectories = self.collect_trajectories(tasks, num_samples=50)
        
        # Calculate advantages
        advantages = self.calculate_advantages(trajectories)
        
        # PPO updates
        total_improvement = 0
        for epoch in range(self.ppo_epochs):
            improvement = self.ppo_update(trajectories, advantages)
            total_improvement += improvement
        
        # Track performance
        avg_reward = np.mean([t['reward'] for t in trajectories])
        self.episode_rewards.append(avg_reward)
        
        return avg_reward, total_improvement

# Initialize PPO training
print("🚀 INITIALIZING PPO TRAINING")
print("="*50)

# Mock AFM model
class MockAFM:
    def __init__(self):
        self.performance = 0.45  # Starting at 45%

afm_model = MockAFM()
ppo_trainer = AFMPPO(afm_model, reward_fn)

print(f"AFM initialized with {afm_model.performance:.1%} baseline performance")
print(f"PPO config: lr={ppo_trainer.learning_rate}, clip_ε={ppo_trainer.clip_epsilon}")

## Step 5: Run PPO Training - Watch the Magic!

Train the AFM and see performance improve.

In [None]:
# Training tasks
training_tasks = [
    "Build REST API for user management",
    "Implement authentication system",
    "Create database schema",
    "Optimize query performance",
    "Add error handling",
    "Write unit tests",
    "Deploy to production",
    "Monitor system health"
]

def run_ppo_training(trainer, tasks, epochs=5):
    """Run complete PPO training"""
    
    print("🎯 STARTING PPO TRAINING")
    print("="*60)
    print(f"Training epochs: {epochs}")
    print(f"Tasks: {len(tasks)}")
    print(f"Target: 55.3% GAIA performance\n")
    
    performance_history = []
    baseline_performance = 0.45  # 45% baseline
    current_performance = baseline_performance
    
    for epoch in range(epochs):
        print(f"\n📍 EPOCH {epoch + 1}/{epochs}")
        print("-" * 40)
        
        # Train one epoch
        avg_reward, improvement = trainer.train_epoch(tasks)
        
        # Update performance (simplified)
        current_performance += improvement * 0.1  # Scale improvement
        current_performance = min(current_performance, 0.60)  # Cap at 60%
        
        performance_history.append(current_performance)
        
        # Show progress
        improvement_pct = (current_performance - baseline_performance) / baseline_performance * 100
        
        print(f"\n📊 Epoch {epoch + 1} Results:")
        print(f"   Average reward: {avg_reward:.3f}")
        print(f"   Performance: {current_performance:.1%}")
        print(f"   Improvement: +{improvement_pct:.1f}% from baseline")
        
        # Progress bar
        progress = int((epoch + 1) / epochs * 20)
        bar = '█' * progress + '░' * (20 - progress)
        print(f"   Progress: {bar} {((epoch + 1) / epochs * 100):.0f}%")
    
    # Final results
    print("\n" + "="*60)
    print("🏆 PPO TRAINING COMPLETE!")
    print(f"\n📈 Performance Journey:")
    print(f"   Baseline (SFT): {baseline_performance:.1%}")
    print(f"   Final (PPO): {current_performance:.1%}")
    
    relative_improvement = (current_performance - baseline_performance) / baseline_performance * 100
    print(f"   Relative gain: +{relative_improvement:.1f}%")
    
    # Compare to CoA paper results
    print(f"\n🎯 CoA Paper Comparison:")
    print(f"   Paper result: 55.3% GAIA")
    print(f"   Our result: {current_performance:.1%}")
    
    if current_performance >= 0.55:
        print("   Status: ✅ TARGET ACHIEVED!")
    elif current_performance >= 0.52:
        print("   Status: 🟡 Close to target")
    else:
        print("   Status: 🔴 Needs more training")
    
    return performance_history

# Run the training!
results = run_ppo_training(ppo_trainer, training_tasks, epochs=5)

## Step 6: Visualize PPO Learning Curve

See how performance improves over training.

In [None]:
def visualize_learning_curve(performance_history):
    """ASCII art learning curve"""
    
    print("\n📈 PPO LEARNING CURVE")
    print("="*50)
    
    # Normalize to 0-20 scale for visualization
    min_perf = min(performance_history)
    max_perf = max(performance_history)
    
    print(f"Performance Range: {min_perf:.1%} → {max_perf:.1%}\n")
    
    # Create ASCII graph
    for epoch, perf in enumerate(performance_history):
        # Scale to 0-30 characters
        bar_length = int((perf - 0.4) / (0.6 - 0.4) * 30)
        bar = '█' * max(0, bar_length)
        
        print(f"Epoch {epoch+1}: {bar:30} {perf:.1%}")
    
    # Add target line
    target_length = int((0.553 - 0.4) / (0.6 - 0.4) * 30)
    target_bar = '─' * target_length + '🎯'
    print(f"Target:  {target_bar:30} 55.3%")
    
    # Show key insights
    improvement = performance_history[-1] - performance_history[0]
    relative_improvement = improvement / performance_history[0] * 100
    
    print(f"\n📊 Key Insights:")
    print(f"   Total improvement: +{improvement:.1%}")
    print(f"   Relative improvement: +{relative_improvement:.1f}%")
    print(f"   Learning rate: {improvement/len(performance_history):.1%} per epoch")

visualize_learning_curve(results)

## Step 7: Compare All Approaches

Traditional → SFT → PPO performance comparison.

In [None]:
def compare_all_approaches():
    """Compare Traditional vs SFT vs PPO"""
    
    approaches = [
        {
            'name': 'Traditional Multi-Agent',
            'performance': 0.42,
            'cost': 3.0,  # 3x API calls
            'latency': 3.0,  # 3x latency
            'complexity': 'High',
            'pros': ['Specialized agents', 'Interpretable'],
            'cons': ['Slow', 'Expensive', 'Complex orchestration']
        },
        {
            'name': 'AFM with SFT',
            'performance': 0.45,
            'cost': 1.0,  # 1x API call
            'latency': 1.0,  # 1x latency
            'complexity': 'Medium',
            'pros': ['Fast', 'Cheap', 'Single model'],
            'cons': ['Limited by training data', 'No exploration']
        },
        {
            'name': 'AFM with PPO',
            'performance': 0.553,
            'cost': 1.0,  # 1x API call
            'latency': 1.0,  # 1x latency
            'complexity': 'Medium',
            'pros': ['Best performance', 'Fast', 'Cheap', 'Self-improving'],
            'cons': ['Complex training', 'Reward engineering']
        }
    ]
    
    print("🏁 COMPLETE COMPARISON")
    print("="*80)
    
    # Performance comparison
    print("\n📊 Performance (GAIA Benchmark):")
    for approach in approaches:
        bar_length = int(approach['performance'] * 50)
        bar = '█' * bar_length
        print(f"  {approach['name']:20} {bar:25} {approach['performance']:.1%}")
    
    # Cost comparison
    print("\n💰 Cost (relative to PPO):")
    for approach in approaches:
        bar_length = int(approach['cost'] * 10)
        bar = '█' * bar_length
        print(f"  {approach['name']:20} {bar:30} {approach['cost']:.1f}x")
    
    # Speed comparison
    print("\n⚡ Speed (relative to PPO):")
    for approach in approaches:
        # Invert for speed (lower latency = higher speed)
        speed = 1 / approach['latency']
        bar_length = int(speed * 10)
        bar = '█' * bar_length
        print(f"  {approach['name']:20} {bar:10} {speed:.1f}x")
    
    # Detailed breakdown
    print("\n📋 Detailed Comparison:")
    print("-" * 80)
    
    for approach in approaches:
        print(f"\n{approach['name']}:")
        print(f"  Performance: {approach['performance']:.1%}")
        print(f"  Cost: {approach['cost']:.1f}x")
        print(f"  Complexity: {approach['complexity']}")
        print(f"  Pros: {', '.join(approach['pros'])}")
        print(f"  Cons: {', '.join(approach['cons'])}")
    
    # Final verdict
    print("\n🏆 WINNER: AFM with PPO")
    print("   🎯 Best performance: 55.3% vs 42% traditional")
    print("   ⚡ 3x faster than traditional")
    print("   💰 3x cheaper than traditional")
    print("   🚀 Continues to improve with more training")

compare_all_approaches()

## Exercise 1: Custom PPO Implementation 🛠️

Build your own PPO algorithm for AFM.

In [None]:
class YourPPO:
    """Your custom PPO implementation"""
    
    def __init__(self, clip_epsilon=0.2):
        self.clip_epsilon = clip_epsilon
        # TODO: Add your parameters
    
    def compute_advantages(self, rewards, values=None):
        """Compute advantage estimates"""
        # TODO: Implement GAE (Generalized Advantage Estimation)
        # or simpler advantage calculation
        advantages = np.array(rewards)  # Placeholder
        return advantages
    
    def ppo_loss(self, old_probs, new_probs, advantages):
        """Compute PPO clipped loss"""
        # TODO: Implement the actual PPO loss
        # L = min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)
        loss = 0.0  # Placeholder
        return loss
    
    def train_step(self, trajectories):
        """One PPO training step"""
        # TODO: Implement complete training step
        pass

# Test your implementation
your_ppo = YourPPO()
test_rewards = [0.5, 0.7, 0.3, 0.8, 0.6]
advantages = your_ppo.compute_advantages(test_rewards)
print(f"Your advantages: {advantages}")
print("Complete the implementation above!")

## Exercise 2: Advanced Reward Engineering 🎯

Design a reward function that gets 60%+ performance.

In [None]:
class AdvancedRewardFunction:
    """Your advanced reward function for AFM"""
    
    def __init__(self):
        # TODO: Design your reward components
        pass
    
    def calculate_reward(self, task, afm_output, ground_truth=None):
        """Calculate advanced reward"""
        # TODO: Implement sophisticated reward calculation
        # Ideas:
        # - Semantic similarity to ground truth
        # - Code execution success (for coding tasks)
        # - User preference modeling
        # - Multi-objective optimization
        # - Uncertainty-aware rewards
        
        reward = 0.5  # Placeholder
        return reward

# Benchmark your reward function
advanced_reward = AdvancedRewardFunction()

test_cases = [
    ("Build API", "[Planner]: Plan ready [Coder]: Code complete [Reviewer]: Approved"),
    ("Debug error", "Error: Cannot process this task"),
    ("Optimize query", "[Planner]: Analysis done [Coder]: Query optimized [Reviewer]: 10x faster")
]

print("Testing your advanced reward function:")
for task, output in test_cases:
    reward = advanced_reward.calculate_reward(task, output)
    print(f"  Task: {task[:15]}... → Reward: {reward:.3f}")

print("\nGoal: Design rewards that guide AFM to 60%+ performance!")

## Exercise 3: Hyperparameter Optimization 🔬

Find the best PPO hyperparameters for AFM training.

In [None]:
def hyperparameter_search(search_space, num_trials=10):
    """Search for optimal PPO hyperparameters"""
    
    # TODO: Implement hyperparameter optimization
    # Techniques to try:
    # 1. Grid search
    # 2. Random search
    # 3. Bayesian optimization
    # 4. Population-based training
    
    results = []
    
    print("🔬 HYPERPARAMETER SEARCH")
    print("="*40)
    
    for trial in range(num_trials):
        # TODO: Sample hyperparameters
        config = {
            'learning_rate': random.uniform(1e-5, 1e-3),
            'clip_epsilon': random.uniform(0.1, 0.3),
            'batch_size': random.choice([32, 64, 128]),
            'ppo_epochs': random.choice([2, 4, 8])
        }
        
        # TODO: Train with these hyperparameters
        performance = random.uniform(0.45, 0.60)  # Simulate
        
        results.append((config, performance))
        print(f"Trial {trial+1}: {performance:.1%} performance")
    
    # Find best configuration
    best_config, best_performance = max(results, key=lambda x: x[1])
    
    print(f"\n🏆 Best configuration:")
    for key, value in best_config.items():
        print(f"  {key}: {value}")
    print(f"Performance: {best_performance:.1%}")
    
    return best_config, best_performance

# Define search space
search_space = {
    'learning_rate': [1e-5, 1e-3],
    'clip_epsilon': [0.1, 0.3],
    'batch_size': [32, 64, 128],
    'ppo_epochs': [2, 4, 8]
}

best_config, best_perf = hyperparameter_search(search_space, num_trials=5)
print(f"\nChallenge: Can you get above 57%? Current best: {best_perf:.1%}")

## Key Takeaways 🎓

1. **PPO Breakthrough**: +18-20% performance over SFT alone
2. **Reward Design**: Critical for guiding AFM learning
3. **Exploration**: PPO discovers strategies beyond training data
4. **Stability**: Clipping prevents catastrophic policy updates
5. **SOTA Results**: 55.3% GAIA beats all multi-agent systems!

## The Complete CoA Pipeline 🔄

We've now built the complete Chain-of-Agents system:

1. **Part 1**: Multi-agent trajectories → Training data
2. **Part 2**: Progressive filtering → High-quality data
3. **Part 3**: SFT → AFM that mimics agents
4. **Part 4**: PPO → AFM that beats agents!

## Why This Is Revolutionary 🌟

**Traditional AI**: Hand-coded agents, complex orchestration
**Chain-of-Agents**: Learn agent behaviors, optimize automatically

- **Performance**: State-of-the-art results
- **Efficiency**: 3x faster, 3x cheaper
- **Scalability**: One model handles all agents
- **Adaptability**: Continues improving with more data

## What's Next? 🚀

Build your own CoA system:
1. Collect domain-specific agent trajectories
2. Design custom reward functions
3. Scale to production workloads
4. Beat the 55.3% GAIA benchmark!

## Final Challenge 🏆

Implement the complete CoA pipeline and achieve:
- **55%+ GAIA performance**
- **< 2 second inference time**
- **< $0.01 per query cost**

## Homework 📝

1. Implement real PPO with PyTorch
2. Design task-specific reward functions
3. Train on 10,000+ trajectories
4. Benchmark against GPT-4 + traditional agents
5. Deploy as production API
6. Read the complete CoA paper from OPPO

**Congratulations!** You've built Chain-of-Agents from scratch! 🎉

You now understand why CoA represents the future of multi-agent AI systems.