<div align="center">
  <img src="../assets/images/hackathon.png" alt="Holistic AI Hackathon Logo" width="600"/>
</div>

**Event**: [hackathon.holisticai.com](https://hackathon.holisticai.com)

---


# Tutorial 7: Reinforcement Learning for Agents

**Learn advanced RL concepts for agent training (Advanced/Optional)**

> ‚ö†Ô∏è **Important**: This is an **advanced, optional tutorial** focused on **conceptual learning**.  
> **Practical RL training requires GPU resources and hours of training time**, which may not be feasible during a 48-hour hackathon.  
> This tutorial teaches **concepts and methods** rather than requiring full implementation.

## What You'll Learn

1. **Understand** why RL can improve agent performance
2. **Learn** the concepts behind RL training (trajectories, rewards, GRPO)
3. **See** how RULER automates reward functions
4. **Explore** training patterns (without full training setup)

## Why This Tutorial?

- **Educational**: Understand advanced agent training techniques
- **Conceptual**: Learn RL concepts without GPU requirements
- **Reference**: See how production RL training works
- **Optional**: Not required for hackathon projects

---

## Prerequisites

- Basic Python knowledge
- Recommended: Completed tutorials 01-03
- Time: ~20 minutes (conceptual learning)
- **Note**: Full RL training requires GPU (8GB+ VRAM) and hours of training time
- **Holistic AI Bedrock API** (optional, for examples) - Credentials will be provided during the hackathon event

**API Guide**: [../assets/api-guide.pdf](../assets/api-guide.pdf)


## What is Reinforcement Learning? (Simple Explanation)

**Think of training an AI agent like training a dog:**

üêï **Traditional AI (Prompting)**: You give the dog a command, and it does its best based on what it already knows. It's like asking a well-trained dog to "sit" - it knows the command, but might not be perfect.

üéì **Reinforcement Learning**: You let the dog try different things, reward it when it does well, and it learns from experience. Over time, the dog gets better and better at the task.

### Real-World Analogy

Imagine teaching a child to play chess:

1. **Traditional Prompting**: You explain the rules once, and the child plays based on that explanation.
   - ‚úÖ Fast to start
   - ‚ùå Limited by initial knowledge
   - ‚ùå Can't improve through practice

2. **Reinforcement Learning**: The child plays many games, learns from wins and losses, and gets better over time.
   - ‚úÖ Learns from experience
   - ‚úÖ Gets better with practice
   - ‚úÖ Can discover new strategies
   - ‚è±Ô∏è Takes time to train

### Key Concepts (In Simple Terms)

**Agent**: The AI that learns (like the chess player)

**Environment**: The task or problem (like the chess board)

**Action**: What the agent does (like making a chess move)

**Reward**: Feedback on how well the agent did (like winning or losing)

**Training**: The process of learning from many attempts

### Why Use RL for AI Agents?

**Regular AI (Prompting)**:
- Like reading a manual once
- Works for common tasks
- Can't improve without new instructions

**RL-Trained AI**:
- Like practicing a skill
- Gets better with experience
- Can handle complex, multi-step tasks
- Learns optimal strategies


import os
from pathlib import Path
from dotenv import load_dotenv

# Load from .env file in parent directory
env_path = Path('../.env')
if env_path.exists():
    load_dotenv(env_path)
    print("üìÑ Loaded configuration from .env file")
else:
    print("‚ö†Ô∏è  No .env file found - using environment variables")

# Verify API keys
print("\nüîë API Key Status:")
if os.getenv('HOLISTIC_AI_TEAM_ID') and os.getenv('HOLISTIC_AI_API_TOKEN'):
    print("  ‚úÖ Holistic AI Bedrock credentials loaded")
elif os.getenv('OPENAI_API_KEY'):
    print("  ‚ö†Ô∏è  OpenAI API key loaded")
else:
    print("  ‚ö†Ô∏è  No API keys found")

print("\nüìÅ Working directory:", Path.cwd())

# Import Holistic AI Bedrock helper
import sys
try:
    sys.path.insert(0, '../core')
    from react_agent.holistic_ai_bedrock import get_chat_model
    print("\n‚úÖ Holistic AI Bedrock helper loaded")
except ImportError:
    print("\n‚ö†Ô∏è  Could not import from core - will use OpenAI only")

# Import official packages
from langgraph.prebuilt import create_react_agent
from langchain_core.messages import HumanMessage
import numpy as np

print("\n‚úÖ All imports successful!")


## ‚ö†Ô∏è Important: Resource Requirements

**This tutorial focuses on concepts, not full training.**

### Full RL Training Requirements (Not Required Here)

If you wanted to do **actual RL training**, you would need:

- **GPU**: 8GB+ VRAM (e.g., RTX 3060) - **Most participants won't have this**
- **Time**: 2-10 hours of training time - **Not feasible in 48-hour hackathon**
- **Setup**: ART backend, model registration - **Complex configuration**

### What This Tutorial Provides Instead

‚úÖ **Conceptual understanding** - Learn how RL works  
‚úÖ **Code patterns** - See training structure  
‚úÖ **Evaluation methods** - Understand how to measure improvements  
‚úÖ **Quick examples** - Simple demonstrations without GPU  

**You can learn RL concepts without actually training!**

---

## Prerequisites

- Basic Python knowledge
- Recommended: Completed tutorials 01-03
- Time: ~20 minutes (conceptual learning)

**Note**: This tutorial teaches concepts. Full RL training requires GPU resources and is not required for hackathon projects.

## Step 0: Setup (Conceptual Only)

This tutorial focuses on concepts. We'll use simple examples that don't require GPU or full training setup.

In [None]:
# Simple setup - no GPU required for conceptual learning
import json
from typing import List, Dict

print("‚úÖ Setup complete!")
print("\nüìö This tutorial focuses on concepts.")
print("   No GPU or full training setup required.")
print("   We'll learn RL patterns through examples.")


‚ö†Ô∏è  No GPU detected!
   RL training will be slow on CPU
   Consider using Google Colab or cloud GPU

‚úÖ Setup complete!
‚úÖ Bedrock API helper loaded


## Step 1: Understanding RL for Agents

### The Two Ways to Make AI Work

Think of AI agents like employees. There are two ways to get them to do work:

#### Method 1: Prompting (What We've Been Doing)

**Like giving instructions to a new employee:**

```python
# You tell the AI what to do
llm = get_chat_model('claude-3-5-sonnet')  # Uses Holistic AI Bedrock
agent = create_react_agent(tools=[search], llm=llm)
result = agent.invoke("Find info about quantum computing")
```

**How it works:**
- You write clear instructions (prompts)
- The AI follows those instructions
- It uses its general knowledge
- Works immediately, no training needed

**Pros:**
- ‚úÖ Fast to set up (minutes)
- ‚úÖ No training needed
- ‚úÖ Works for many tasks
- ‚úÖ Easy to change instructions

**Cons:**
- ‚ùå Limited by the AI's general knowledge
- ‚ùå May fail on complex, specific tasks
- ‚ùå Can't improve without new prompts
- ‚ùå Not specialized for your specific needs

**Real Example**: Like asking a general-purpose assistant to help with your specific workflow. They understand, but might not be perfect.

#### Method 2: RL Training (What We'll Learn)

**Like training an employee through practice:**

```python
# You let the AI practice and learn
trained_agent = art.TrainableModel(
    base_model='unsloth/Llama-3.2-3B-Instruct'
)
await trained_agent.train(trajectory_groups)  # Learn from experience
```

**How it works:**
- The AI tries many different approaches
- It gets feedback on what works (rewards)
- It learns from successes and failures
- It gets better over time

**Pros:**
- ‚úÖ Specialized for your specific task
- ‚úÖ More reliable and consistent
- ‚úÖ Better at complex, multi-step tasks
- ‚úÖ Can discover optimal strategies
- ‚úÖ Improves with more training

**Cons:**
- ‚è±Ô∏è Takes time to train (hours)
- üíª Needs GPU resources
- üìä Requires training data/scenarios
- üîß More complex setup

**Real Example**: Like training a specialist who practices your specific workflow until they become an expert.

### Visual Comparison

```
Traditional Prompting:
You ‚Üí Instructions ‚Üí AI ‚Üí Response
(One-time, based on general knowledge)

RL Training:
You ‚Üí Scenarios ‚Üí AI tries ‚Üí Gets feedback ‚Üí Learns ‚Üí Tries again ‚Üí Better!
(Repeated practice, learns from experience)
```

### When Should You Use Each Method?

**Use Traditional Prompting When:**
- ‚úÖ Task is simple and straightforward
- ‚úÖ You need results quickly
- ‚úÖ Task changes frequently
- ‚úÖ 70-80% success rate is acceptable
- ‚úÖ You don't have GPU resources

**Use RL Training When:**
- ‚úÖ Task requires 90%+ success rate
- ‚úÖ Complex, multi-step interactions
- ‚úÖ Task is repetitive and well-defined
- ‚úÖ You have training scenarios or can generate them
- ‚úÖ You need specialized performance
- ‚úÖ You have GPU resources available

### Real-World Examples

**Traditional Prompting Works Well For:**
- Answering general questions
- Simple one-step tasks
- Creative writing
- Quick prototypes

**RL Training Works Better For:**
- Email management agents (learn your specific workflow)
- Game-playing agents (learn strategies)
- Customer support (learn to handle complex cases)
- Research agents (learn to use tools effectively)

### The Bottom Line

**Prompting** = Fast, general-purpose, good enough for many tasks

**RL Training** = Slower, specialized, excellent for specific tasks

**In this tutorial**, you'll learn how to use RL training to create highly specialized AI agents that excel at specific tasks! üéØ

## Step 2: ART Architecture Overview

### What is ART? (Simple Explanation)

**ART (Agent Reinforcement Trainer)** is like a training gym for AI agents:

- üèãÔ∏è **The Gym (ART Backend)**: Where the actual training happens
- üë§ **The Trainer (Your Code)**: You define what the agent should learn
- üéØ **The Student (AI Agent)**: The AI that gets trained
- üèÜ **The Scoreboard (RULER)**: Automatically judges how well the agent did

### The Training Process (Step by Step)

Think of it like training for a sport:

1. **You provide scenarios** (like practice drills)
   - "Answer this question"
   - "Solve this problem"
   - "Complete this task"

2. **The agent tries multiple times** (like practicing a move)
   - Each attempt is called a "rollout"
   - Agent tries different approaches
   - Creates multiple solutions

3. **RULER judges the attempts** (like a coach scoring performance)
   - Compares all attempts
   - Ranks them from best to worst
   - Assigns scores automatically

4. **The agent learns** (like improving through practice)
   - Sees what worked well
   - Adjusts its strategy
   - Gets better over time

5. **Repeat** (like training sessions)
   - More scenarios = More practice
   - More training steps = More improvement

### ART Architecture (Technical View)

ART (Agent Reinforcement Trainer) has a client-server architecture:

```
                      Your Codebase                          

    ART Client (Frontend)                                 
    - Minimal dependencies                                
    - Wraps your agent logic                             
    - Collects trajectories                              


                          
                           API
                          

              ART Backend (Training Server)                  

    - Unsloth GRPO Trainer                                
    - GPU acceleration                                    
    - Model serving (OpenAI-compatible)                  
    - RULER reward function                              
```

### Key Components Explained Simply

1. **Trajectory** - A record of what the agent did
   - Like a game replay showing all moves
   - Contains: question, agent's response, reward score

2. **Rollout Function** - How your agent tries the task
   - Like a practice attempt
   - Agent generates a response
   - Records what happened

3. **Reward Function (RULER)** - Scores how well the agent did
   - Like a judge giving scores
   - Compares multiple attempts
   - Automatically assigns scores (0.0 to 1.0)

4. **Training Loop** - The learning process
   - Collects many trajectories
   - Scores them with RULER
   - Updates the agent's "brain" (model weights)
   - Repeats to improve

### Why This Architecture?

**Separation of Concerns:**
- Your code focuses on defining the task
- ART backend handles the complex training
- You don't need to understand all the math!

**Scalability:**
- Backend can run on powerful GPUs
- Your code can run anywhere
- Easy to scale up training

**Flexibility:**
- Works with different models
- Supports various tasks
- Easy to experiment

ART (Agent Reinforcement Trainer) has a client-server architecture:

```

                      Your Codebase                          
    
    ART Client (Frontend)                                 
    - Minimal dependencies                                
    - Wraps your agent logic                             
    - Collects trajectories                              
    

                          
                           API
                          

              ART Backend (Training Server)                  
    
    - Unsloth GRPO Trainer                                
    - GPU acceleration                                    
    - Model serving (OpenAI-compatible)                  
    - RULER reward function                              
    

```

### Key Components

1. **Trajectory** - Record of agent's actions and environment responses
2. **Rollout Function** - How your agent interacts with the environment
3. **Reward Function** - Scores how well the agent did (RULER automates this!)
4. **Training Loop** - GRPO updates the model based on rewards

## Step 2.5: Supported Models for RL Training

ART + Unsloth support many models for RL training. Here are the main options:

### Small Models (Recommended for Limited Resources) üöÄ

**Best for: Low VRAM (4-8GB), Fast Training, Quick Iteration**

- **Llama 3.2 1B** - `unsloth/Llama-3.2-1B-Instruct` (1B params, ~4GB VRAM)
- **Llama 3.2 3B** - `unsloth/Llama-3.2-3B-Instruct` (3B params, ~6GB VRAM)
- **Qwen2.5 Coder 1.5B** - `unsloth/Qwen2.5-Coder-1.5B-Instruct` (1.5B params, ~4GB VRAM)
- **Qwen3 4B** - `unsloth/Qwen3-4B-Instruct` (4B params, ~8GB VRAM)
- **Phi-3.5 Mini** - `unsloth/Phi-3.5-Mini-Instruct` (~3.8B params, ~6GB VRAM)
- **Phi-4** - `unsloth/Phi-4` (Latest Phi model)

### Medium Models (Balanced Performance)

**Best for: Moderate VRAM (8-16GB), Better Quality**

- **Qwen3-8B** - `unsloth/Qwen3-8B-Instruct` (8B params, ~12GB VRAM)
- **Qwen2.5-7B** - `unsloth/Qwen2.5-7B-Instruct` (7B params, ~10GB VRAM)
- **Llama 3.1 8B** - `unsloth/Llama-3.1-8B-Instruct-bnb-4bit` (8B params, ~10GB VRAM)
- **Mistral 7B** - `unsloth/Mistral-7B-Instruct-v0.3-bnb-4bit` (7B params, ~10GB VRAM)

### Large Models (Maximum Performance)

**Best for: High VRAM (16GB+), Production Quality**

- **Qwen3-14B** - `unsloth/Qwen3-14B-Instruct` (14B params, ~18GB VRAM)
- **Qwen2.5-14B** - `unsloth/Qwen2.5-14B-Instruct` (14B params, ~18GB VRAM)
- **Llama 3.1 70B** - `unsloth/Llama-3.1-70B-Instruct-bnb-4bit` (70B params, ~40GB VRAM)

### Model Selection Guide

**Choose Small Models (1-4B) if:**
- ‚úÖ Limited GPU memory (4-8GB VRAM)
- ‚úÖ Need fast training iterations
- ‚úÖ Testing/prototyping
- ‚úÖ Simple tasks

**Choose Medium Models (7-8B) if:**
- ‚úÖ Moderate GPU (8-16GB VRAM)
- ‚úÖ Need better quality
- ‚úÖ Complex tasks
- ‚úÖ Production deployment

**Choose Large Models (14B+) if:**
- ‚úÖ High-end GPU (16GB+ VRAM)
- ‚úÖ Maximum quality needed
- ‚úÖ Research/competition

**Note**: All models support 4-bit quantization (`-bnb-4bit` suffix) for memory efficiency.

**For Judge Models** (RULER scoring):
- Use **Holistic AI Bedrock API** (recommended): `claude-3-5-sonnet`, `claude-3-5-haiku`
- Or OpenAI: `gpt-5-mini`, `gpt-4-turbo`
- Or local models via Ollama

**GPU Requirements**:
- 1-3B models: 4-6GB VRAM
- 4-8B models: 8-12GB VRAM
- 14B models: 16-20GB VRAM
- 70B models: 40GB+ VRAM (or use quantization)

## Step 3: Simple Example - Math Problem Solver

Let's train an agent to solve math problems. This is a simpler example to understand the concepts.

In [20]:
# Simplified conceptual example (not executable without full ART setup)
# This shows the pattern you'll use

print(" Conceptual Example: Training a Math Agent\n")
print("="*70)

# 1. Define what you want to train
print("\n1‚É£  SETUP")
print("   model = art.TrainableModel(")
print("       name='math-agent',")
print("       base_model='unsloth/Llama-3.2-3B-Instruct'  # Small model (~6GB VRAM). Or Qwen3-4B, Qwen3-14B, etc.")
print("   )")

# 2. Define how the agent works
print("\n2‚É£  ROLLOUT FUNCTION")
print("   async def rollout(model, problem):")
print("       # Agent tries to solve the problem")
print("       response = await model.generate(problem)")
print("       return trajectory  # Record of what happened")

# 3. Score the attempts
print("\n3‚É£  REWARD")
print("   # Option A: Manual reward")
print("   reward = 1.0 if answer_correct else 0.0")
print("   ")
print("   # Option B: RULER (automatic!)")
print("   reward = await ruler_score(trajectory, judge_model='openai/o3')")

# 4. Train!
print("\n4‚É£  TRAINING")
print("   await model.train(trajectory_groups)")
print("   # Model learns: 'What actions led to good rewards?'")

print("\n" + "="*70)
print("\n The agent improves by trying many problems,")
print("   getting rewards, and learning what works!")

# Run training
import asyncio

 Conceptual Example: Training a Math Agent


1‚É£  SETUP
   model = art.TrainableModel(
       name='math-agent',
       base_model='unsloth/Llama-3.2-3B-Instruct'  # Small model (~6GB VRAM). Or Qwen3-4B, Qwen3-14B, etc.
   )

2‚É£  ROLLOUT FUNCTION
   async def rollout(model, problem):
       # Agent tries to solve the problem
       response = await model.generate(problem)
       return trajectory  # Record of what happened

3‚É£  REWARD
   # Option A: Manual reward
   reward = 1.0 if answer_correct else 0.0
   
   # Option B: RULER (automatic!)
   reward = await ruler_score(trajectory, judge_model='openai/o3')

4‚É£  TRAINING
   await model.train(trajectory_groups)
   # Model learns: 'What actions led to good rewards?'


 The agent improves by trying many problems,
   getting rewards, and learning what works!


## Step 3.5: Training Scenarios

Define scenarios to train your agent on:

In [21]:
from pydantic import BaseModel, Field
from typing import List, Optional
from datasets import load_dataset

class TrainingScenario(BaseModel):
    """A single training scenario."""
    id: str
    question: str
    expected_answer: str
    difficulty: str = "medium"  # easy, medium, hard
    subject: Optional[str] = None

# Load HLE dataset for realistic training scenarios
print("Loading HLE dataset for training scenarios...")
hle_dataset = load_dataset('cais/hle', split='test')

# Select diverse HLE questions for training
# Choose questions from different subjects and difficulty levels
selected_indices = [0, 100, 500, 1000, 1500]  # Diverse samples
training_scenarios = []

for idx in selected_indices:
    if idx < len(hle_dataset):
        sample = hle_dataset[idx]
        scenario = TrainingScenario(
            id=f"hle_{idx:04d}",
            question=sample.get('question', ''),
            expected_answer=sample.get('answer', ''),
            difficulty="hard",  # HLE questions are PhD-level
            subject=sample.get('raw_subject', 'Unknown')
        )
        training_scenarios.append(scenario)

print(f"‚úÖ Loaded {len(training_scenarios)} HLE training scenarios")
print(f"\n Sample scenarios:")
for s in training_scenarios[:3]:
    print(f"   [{s.subject}] {s.question[:60]}...")
    print(f"      Expected: {s.expected_answer[:40]}...")


 Loaded 3 training scenarios

 Sample:
   Q: What is 2 + 2?
   A: 4
   Difficulty: easy


## Step 4: Complete ART Template

Here's a complete template you can adapt for your use case:

In [26]:
# Training Loop - Following tic_tac_toe Pattern

TRAINING_STEPS = 2
ROLLOUTS_PER_STEP = 8
LEARNING_RATE = 5e-5

print("\n" + "=" * 70)
print("üöÄ Starting Training Loop")
print("=" * 70)
print(f"Training steps: {TRAINING_STEPS}")
print(f"Rollouts per step: {ROLLOUTS_PER_STEP}")
print(f"Learning rate: {LEARNING_RATE}")

# Training loop (following tic_tac_toe pattern)
for i in range(await model.get_step(), TRAINING_STEPS):
    print(f"\nüìä Step {i + 1}/{TRAINING_STEPS}")
    
    # Gather trajectory groups with RULER scoring
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(
                rollout(model, scenario)
                for _ in range(ROLLOUTS_PER_STEP)
            )
            for scenario in training_scenarios
        ),
        after_each=lambda group: ruler_score_group(
            group,
            "openai/gpt-5-mini",  # Judge model for RULER
            swallow_exceptions=True
        ),
        pbar_desc="gather",
    )
    
    # Delete old checkpoints (like tic_tac_toe)
    await model.delete_checkpoints()
    
    # Train with config (like tic_tac_toe)
    print(f"\nüîÑ Training model (step {i + 1})...")
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=LEARNING_RATE)
    )
    print(f"‚úÖ Step {i + 1} complete!")

print("\n" + "=" * 70)
print("‚úÖ Training Complete!")
print("=" * 70)
print("\nModel weights have been updated via GRPO!")
print("The model should now perform better on HLE questions.")

 Complete ART Training Template

This is a reference template. Adapt for your specific task!

import art
from art.rewards import ruler_score_group

# 1. Initialize trainable model
model = art.TrainableModel(
    name="my-agent",
    project="my-project",
    base_model="unsloth/Llama-3.2-3B-Instruct",  # Small model (~6GB). Or Qwen3-4B, Qwen3-14B, Llama-3.1-8B, etc.
)

# 2. Define your scenarios (what to train on)
scenarios = [
    {"task": "Find information about quantum computing"},
    {"task": "Solve: What is 2+2?"},
    # Add more scenarios...
]

# 3. Define rollout function (how agent interacts)
async def rollout(model: art.Model, scenario: dict) -> art.Trajectory:
    # Get OpenAI-compatible client
    client = model.openai_client()

    # Create trajectory
    trajectory = art.Trajectory(
        messages_and_choices=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": scenario["task"]}
        ]
    )

    # Agent

## Step 4.5: Key RL Concepts for Agents

Before diving into training, understand these critical concepts:

### 1. Trajectory Diversity

**Why it matters**: Multiple rollouts per scenario (`rollouts_per_group`) create diverse trajectories, allowing RULER to compare and rank them.

- **More rollouts** = Better comparison = More reliable rewards
- **Recommended**: 4-8 rollouts per scenario
- **Trade-off**: More rollouts = Slower training but better quality

```python
# Good: Multiple diverse trajectories
art.TrajectoryGroup(
    rollout(model, scenario) for _ in range(8)  # 8 diverse attempts
)
```

### 2. RULER Reward Function

**RULER (Reward Using LLM Evaluation and Ranking)** automatically scores trajectories:

- Compares multiple trajectories for the same scenario
- Uses a judge model (e.g., `claude-3-5-sonnet`, `gpt-5-mini`) to rank them
- Assigns relative scores (0.0-1.0) based on quality
- **No manual reward engineering needed!**

**Best Practices**:
- Use `swallow_exceptions=True` for robust error handling
- Choose a strong judge model (API models recommended)
- Ensure judge model understands your task domain

```python
ruler_score_group(
    group,
    "openai/gpt-5-mini",  # Strong judge model
    swallow_exceptions=True  # Handle errors gracefully
)
```

### 3. Exploration vs Exploitation

**Exploration**: Agent tries different approaches (high temperature, diverse rollouts)
**Exploitation**: Agent uses learned knowledge (lower temperature, focused responses)

- **During Training**: Balance exploration (to find good strategies) and exploitation (to refine them)
- **Rollout Temperature**: Start higher (0.7-0.9) for exploration, lower (0.3-0.5) for exploitation
- **Multiple Rollouts**: Natural exploration mechanism

### 4. Training Stability

**Key Techniques**:

1. **Delete Checkpoints**: Clear old checkpoints before training
   ```python
   await model.delete_checkpoints()
   ```

2. **Learning Rate**: Start with 5e-5, adjust based on convergence
   - Too high: Unstable training, loss spikes
   - Too low: Slow convergence

3. **Training Steps**: Monitor for convergence
   - Early stopping if rewards plateau
   - Continue if rewards still improving

### 5. Common Pitfalls & Solutions

| Problem | Cause | Solution |
|---------|-------|----------|
| **Reward Hacking** | Model finds shortcuts | Use diverse scenarios, check outputs manually |
| **Low Rewards** | Poor judge model or scenarios | Use stronger judge, verify scenario quality |
| **Training Slow** | Too many rollouts/scenarios | Reduce rollouts, use smaller model |
| **No Improvement** | Learning rate too low/high | Tune learning rate, check reward distribution |
| **GPU OOM** | Model too large | Use smaller model, reduce batch size |

### 6. Multi-Turn Interactions

**For complex tasks**, allow multiple turns:

```python
async def rollout(scenario: dict) -> art.Trajectory:
    trajectory = art.Trajectory(messages_and_choices=[...])
    
    # Multi-turn loop
    for turn in range(MAX_TURNS):
        response = await client.chat.completions.create(...)
        trajectory.add_assistant_message(response.content)
        
        # Check if task is complete
        if task_complete(response):
            break
        
        # Add follow-up message
        trajectory.add_user_message("Continue...")
    
    return trajectory
```

### 7. Tool Usage in RL

**Training agents to use tools effectively**:

- Include tool calls in trajectories
- RULER judges tool usage quality
- Model learns when and how to use tools

**Example**: Agent learns to call search tool before answering questions

---

**Next**: Apply these concepts in Step 6 (Real Training)!

## Step 5: Advanced Training Tips

For production RL training, consider:

### Training Best Practices

1. **Start Small**
   - Begin with 5-10 scenarios
   - Use 2-4 rollouts per scenario initially
   - Test the pipeline before scaling up
   - Verify RULER scoring works correctly

2. **Monitor Training**
   - Watch reward scores improve over steps
   - Check for reward hacking (model finds shortcuts)
   - Validate on held-out scenarios
   - Monitor GPU memory usage

3. **Iterate Quickly**
   - Add more scenarios gradually
   - Tune learning rate (typically 1e-5 to 5e-5)
   - Adjust rollout count based on GPU memory
   - Use early stopping if rewards plateau

4. **Evaluate Regularly**
   - Test trained model on new scenarios
   - Compare to baseline (prompted model)
   - Use HLE benchmark (Tutorial 6) for evaluation
   - Check for overfitting (good on train, bad on test)

### Key Hyperparameters

- **Learning Rate**: 1e-5 to 5e-5 (start with 5e-5)
  - Too high: Unstable, loss spikes
  - Too low: Slow convergence
  - Adjust based on reward improvement rate

- **Rollouts per Scenario**: 4-8 (more = better but slower)
  - More rollouts = Better RULER comparison
  - Trade-off: Quality vs Speed
  - Start with 4, increase if needed

- **Training Steps**: 2-10 (depends on convergence)
  - Monitor reward improvement
  - Stop if rewards plateau
  - Continue if still improving

- **Temperature**: 0.7-0.9 during training (exploration)
  - Higher = More diverse trajectories
  - Lower = More focused responses
  - Balance exploration vs exploitation

- **Batch Size**: Auto-configured by ART (usually optimal)

### Troubleshooting

**Low Rewards**: 
- Increase rollouts per scenario (better RULER comparison)
- Check RULER judge model quality (use stronger model)
- Verify scenarios are appropriate and clear
- Check if judge model understands your task domain

**Training Slow**:
- Reduce rollouts per scenario
- Use smaller base model
- Check GPU utilization (should be high)
- Reduce number of scenarios per batch

**Model Not Improving**:
- Increase training steps (may need more iterations)
- Adjust learning rate (try 1e-5 or 2e-5)
- Verify reward function (RULER) is working correctly
- Check if scenarios are too diverse (may need more focused set)

**Reward Hacking**:
- Model finds shortcuts instead of solving task
- **Solution**: Add more diverse scenarios, manually check outputs
- Use stronger judge model
- Add explicit constraints in system prompt

**GPU Out of Memory**:
- Use smaller model (1-3B instead of 7-14B)
- Reduce batch size (if configurable)
- Use gradient checkpointing
- Reduce sequence length

### Evaluation Metrics

Track these metrics to measure improvement:

1. **Average Reward**: Should increase over training steps
2. **Success Rate**: Percentage of scenarios solved correctly
3. **Tool Usage Accuracy**: If using tools, how often used correctly
4. **Response Quality**: Manual evaluation of outputs
5. **Baseline Comparison**: Compare to prompted model performance

### Next Steps After Training

1. **Save Model**: Trained model is automatically saved by ART backend
2. **Test on New Scenarios**: Verify generalization
3. **Deploy**: Use `model.openai_client()` for inference
4. **Iterate**: Add more scenarios, retrain if needed

## Step 7: Evaluating Trained Agents

How do you know if RL training worked?

In [None]:
# Real ART Training - Complete Runnable Example
# Following tic_tac_toe pattern with backend options

import art
from art.rewards import ruler_score_group
from datasets import load_dataset
from dotenv import load_dotenv
from pathlib import Path
import os
import asyncio
import random
from openai import AsyncOpenAI

# Load environment
load_dotenv(Path('../.env'))

# Choose backend based on availability
OPENPIPE_API_KEY = os.getenv('OPENPIPE_API_KEY')

if OPENPIPE_API_KEY:
    print("‚úÖ Using OpenPipe Cloud Backend (no GPU needed)")
    backend = None  # Cloud backend auto-detected
else:
    print("‚ö†Ô∏è  Using LocalBackend (requires NVIDIA GPU)")
    print("   For macOS without GPU, set OPENPIPE_API_KEY in .env")
    try:
        from art.local import LocalBackend
        backend = LocalBackend(path="./.art")
    except ImportError:
        print("‚ùå LocalBackend not available")
        print("   Install: pip install openpipe-art[backend]")
        backend = None

# Load HLE scenarios
hle_dataset = load_dataset('cais/hle', split='test')
selected_indices = [0, 100, 500]
training_scenarios = []

for idx in selected_indices:
    if idx < len(hle_dataset):
        sample = hle_dataset[idx]
        training_scenarios.append({
            'id': f"hle_{idx:04d}",
            'question': sample.get('question', ''),
            'expected_answer': sample.get('answer', ''),
            'subject': sample.get('raw_subject', 'Unknown')
        })

print(f"‚úÖ Loaded {len(training_scenarios)} HLE scenarios")

# Initialize model
random.seed(42)
model = art.TrainableModel(
    name="hle-agent",
    project="hackathon-rl-training",
    base_model="unsloth/Llama-3.2-3B-Instruct",
)

print("‚úÖ TrainableModel created")
print(f"   Name: {model.name}")
print(f"   Project: {model.project}")
print(f"   Base model: unsloth/Llama-3.2-3B-Instruct")


In [None]:
# Register Model and Define Rollout Function
# Following tic_tac_toe pattern

async def register_and_train():
    # Register model
    print("\n1. Registering model...")
    try:
        if backend:
            await model.register(backend)
        else:
            # Cloud backend (auto-detected from OPENPIPE_API_KEY)
            await model.register()
        print("‚úÖ Model registered!")
    except Exception as e:
        print(f"‚ùå Registration failed: {e}")
        print("\nüí° Backend Setup:")
        if not OPENPIPE_API_KEY:
            print("   Option 1: Set OPENPIPE_API_KEY in .env (recommended for macOS)")
            print("   Option 2: Use LocalBackend with NVIDIA GPU")
        return
    
    # Define rollout function (following tic_tac_toe pattern)
    async def rollout(scenario: dict) -> art.Trajectory:
        trajectory = art.Trajectory(
            messages_and_choices=[
                {
                    "role": "system",
                    "content": "You are answering a question from Humanity's Last Exam (HLE), a PhD-level academic benchmark. Provide your best answer."
                },
                {
                    "role": "user",
                    "content": scenario['question']
                }
            ],
            reward=0,
        )
        
        try:
            # Use model's inference API (like tic_tac_toe)
            client = AsyncOpenAI(
                base_url=model.inference_base_url,
                api_key=model.inference_api_key,
            )
            
            chat_completion = await client.chat.completions.create(
                model=model.get_inference_name(),
                messages=trajectory.messages(),
                max_completion_tokens=512,
            )
            
            choice = chat_completion.choices[0]
            content = choice.message.content
            assert isinstance(content, str)
            trajectory.messages_and_choices.append(choice)
            
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Rollout error: {e}")
            trajectory.reward = -1
        
        return trajectory
    
    # Training loop (following tic_tac_toe pattern)
    print("\n2. Gathering trajectory groups with RULER scoring...")
    
    TRAINING_STEPS = 2
    ROLLOUTS_PER_STEP = 4
    LEARNING_RATE = 5e-5
    
    print(f"\nTraining configuration:")
    print(f"  Steps: {TRAINING_STEPS}")
    print(f"  Rollouts per scenario: {ROLLOUTS_PER_STEP}")
    print(f"  Learning rate: {LEARNING_RATE}")
    
    for i in range(await model.get_step(), TRAINING_STEPS):
        print(f"\nüìä Step {i + 1}/{TRAINING_STEPS}")
        
        # Gather trajectory groups with RULER scoring
        train_groups = await art.gather_trajectory_groups(
            (
                art.TrajectoryGroup(
                    rollout(scenario)
                    for _ in range(ROLLOUTS_PER_STEP)
                )
                for scenario in training_scenarios
            ),
            after_each=lambda group: ruler_score_group(
                group,
                "openai/gpt-5-mini",  # Judge model
                swallow_exceptions=True
            ),
            pbar_desc="gather",
        )
        
        # Delete old checkpoints
        await model.delete_checkpoints()
        
        # Train with config
        print(f"üîÑ Training model (step {i + 1})...")
        await model.train(
            train_groups,
            config=art.TrainConfig(learning_rate=LEARNING_RATE)
        )
        print(f"‚úÖ Step {i + 1} complete!")
    
    print("\n" + "=" * 70)
    print("‚úÖ Training Complete!")
    print("=" * 70)
    print("\nModel weights have been updated via GRPO!")
    print("The model should now perform better on HLE questions.")

# Run training
asyncio.run(register_and_train())


In [27]:
print(" Evaluation Strategies\n")
print("="*70)

print("\n1‚É£  BENCHMARK COMPARISON")
print("   Before Training:  65% success rate")
print("   After Training:   92% success rate")
print("   Improvement:      +27 percentage points!")

print("\n2‚É£  TOOL USAGE ANALYSIS")
print("   Before: Uses search incorrectly 40% of time")
print("   After:  Uses search correctly 95% of time")

print("\n3‚É£  MULTI-TURN PERFORMANCE")
print("   Before: Gives up after 2 failed attempts")
print("   After:  Tries alternative approaches, self-corrects")

print("\n4‚É£  HUMAN EVALUATION")
print("   Before: 'Sometimes helpful'")
print("   After:  'Reliably solves complex tasks'")

print("\n5‚É£  ABLATION STUDIES")
print("   Test: What if we remove RL training?")
print("   Result: Performance drops back to baseline")
print("   Conclusion: RL training is working!")

print("\n" + "="*70)
print("\n Use the HLE benchmark (06_benchmark_evaluation.ipynb) to test!")

 Evaluation Strategies


1‚É£  BENCHMARK COMPARISON
   Before Training:  65% success rate
   After Training:   92% success rate
   Improvement:      +27 percentage points!

2‚É£  TOOL USAGE ANALYSIS
   Before: Uses search incorrectly 40% of time
   After:  Uses search correctly 95% of time

3‚É£  MULTI-TURN PERFORMANCE
   Before: Gives up after 2 failed attempts
   After:  Tries alternative approaches, self-corrects

4‚É£  HUMAN EVALUATION
   Before: 'Sometimes helpful'
   After:  'Reliably solves complex tasks'

5‚É£  ABLATION STUDIES
   Test: What if we remove RL training?
   Result: Performance drops back to baseline
   Conclusion: RL training is working!


 Use the HLE benchmark (06_hle_benchmark.ipynb) to test!


## Summary

### What You've Learned

‚úÖ **RL Concepts** - Understanding reinforcement learning for agents  
‚úÖ **Training Patterns** - How RL training works conceptually  
‚úÖ **RULER Rewards** - Automatic reward functions  
‚úÖ **When to Use RL** - Resource requirements and trade-offs  
‚úÖ **Alternative Approaches** - How to improve agents without RL training  

### Key Takeaways

#### RL Training: Powerful but Resource-Intensive

**RL Benefits:**
- Can improve success rates by 15-30%
- Better tool usage and multi-turn performance
- Specialized for specific tasks

**RL Requirements:**
- ‚ö†Ô∏è GPU resources (8GB+ VRAM)
- ‚ö†Ô∏è Hours of training time
- ‚ö†Ô∏è Complex setup (ART backend)
- ‚ö†Ô∏è Not practical for 48-hour hackathons

#### For Hackathon Projects: Use Prompting + Optimization

**Recommended Approach:**
1. **Better Prompts** (Tutorial 03) - Quick improvements
2. **Structured Output** (Tutorial 03) - More reliable
3. **Custom Tools** (Tutorial 02) - Extend capabilities
4. **Monitoring** (Tutorial 04) - Track performance
5. **Observability** (Tutorial 05) - Debug issues
6. **Testing** (Tutorial 6, 08) - Validate improvements

**These methods work well within hackathon timeframes!**

### When to Consider RL Training

**After the Hackathon:**
- If you have GPU resources
- If you want to specialize your agent further
- If you have weeks/months for training
- If you need 90%+ success rates

**During the Hackathon:**
- Focus on prompting and optimization
- Use monitoring to identify improvements
- Test thoroughly with benchmarks
- Iterate quickly on prompts and tools

### Resources

- **RL Concepts**: This tutorial (conceptual learning)
- **Prompting**: Tutorials 01-03
- **Optimization**: Tutorials 04-05
- **Testing**: Tutorials 06, 08
- **Documentation**: [ART Documentation](https://docs.openpipe.ai/art), [Unsloth RL Guide](https://docs.unsloth.ai/get-started/reinforcement-learning-rl-guide/training-ai-agents-with-rl)

### Next Steps

1. **For Hackathon**: Focus on Tutorials 01-06, 08
2. **After Hackathon**: Consider RL training if you have resources
3. **Learning**: Use this tutorial to understand RL concepts

---

**Remember**: RL training is powerful but not required for hackathon success.  
**Focus on**: Better prompts, tools, monitoring, and testing! üöÄ

## Step 6: Real RL Training with ART Backend

**This section shows how to run REAL RL training with actual model weight updates.**

### Prerequisites

1. **ART Backend Server**: Must be running
   ```bash
   art serve
   ```

2. **GPU Recommended**: For efficient training

3. **Model Available**: Base model must be on Hugging Face

### Setup ART Backend

ART requires a backend server to handle model training. You have two options:

**Option 1: Local Backend (Recommended for Hackathon)**
```bash
pip install openpipe-art
art serve
```

**Option 2: OpenPipe Cloud Backend**
- Requires OpenPipe API key
- Set `OPENPIPE_API_KEY` environment variable

### Real Training Code

The code below will:
1. Create a `TrainableModel`
2. Register it with the backend
3. Run rollouts and collect trajectories
4. Score with RULER
5. **Actually update model weights via GRPO**

### Backend Options

**Option 1: LocalBackend (Requires NVIDIA GPU)**
- Uses vLLM for model serving
- Requires CUDA (NVIDIA GPU)
- MPS (macOS GPU) support is limited in vLLM
- Best for: Linux/Windows with NVIDIA GPU

**Option 2: OpenPipe Cloud Backend (Recommended for macOS)**
- No local GPU needed
- Works on macOS, Linux, Windows
- Set `OPENPIPE_API_KEY` in `.env`
- Best for: Hackathon participants

**Note**: If you're on macOS without NVIDIA GPU, use OpenPipe Cloud Backend instead of LocalBackend.


In [None]:
# Real ART Training - Complete Runnable Example
# Following tic_tac_toe pattern with backend options

import art
from art.rewards import ruler_score_group
from datasets import load_dataset
from dotenv import load_dotenv
from pathlib import Path
import os
import asyncio
import random
from openai import AsyncOpenAI

# Load environment
load_dotenv(Path('../.env'))

# Choose backend based on availability
OPENPIPE_API_KEY = os.getenv('OPENPIPE_API_KEY')

if OPENPIPE_API_KEY:
    print("‚úÖ Using OpenPipe Cloud Backend (no GPU needed)")
    backend = None  # Cloud backend auto-detected
else:
    print("‚ö†Ô∏è  Using LocalBackend (requires NVIDIA GPU)")
    print("   For macOS without NVIDIA GPU, set OPENPIPE_API_KEY in .env")
    try:
        from art.local import LocalBackend
        backend = LocalBackend(path="./.art")
    except ImportError:
        print("‚ùå LocalBackend not available")
        print("   Install: pip install openpipe-art[backend]")
        backend = None

# Load HLE scenarios
hle_dataset = load_dataset('cais/hle', split='test')
selected_indices = [0, 100, 500]
training_scenarios = []

for idx in selected_indices:
    if idx < len(hle_dataset):
        sample = hle_dataset[idx]
        training_scenarios.append({
            'id': f"hle_{idx:04d}",
            'question': sample.get('question', ''),
            'expected_answer': sample.get('answer', ''),
            'subject': sample.get('raw_subject', 'Unknown')
        })

print(f"‚úÖ Loaded {len(training_scenarios)} HLE scenarios")

# Initialize model
random.seed(42)
model = art.TrainableModel(
    name="hle-agent",
    project="hackathon-rl-training",
    base_model="unsloth/Llama-3.2-3B-Instruct",
)

print("‚úÖ TrainableModel created")

# Register and Train Function
async def register_and_train():
    # Register model
    print("\n1. Registering model...")
    try:
        if backend:
            await model.register(backend)
        else:
            await model.register()  # Cloud backend
        print("‚úÖ Model registered!")
    except Exception as e:
        print(f"‚ùå Registration failed: {e}")
        print("\nüí° Backend Setup:")
        if not OPENPIPE_API_KEY:
            print("   Option 1: Set OPENPIPE_API_KEY in .env (recommended for macOS)")
            print("   Option 2: Use LocalBackend with NVIDIA GPU")
        return
    
    # Define rollout function (following tic_tac_toe pattern)
    async def rollout(scenario: dict) -> art.Trajectory:
        trajectory = art.Trajectory(
            messages_and_choices=[
                {
                    "role": "system",
                    "content": "You are answering a question from Humanity's Last Exam (HLE), a PhD-level academic benchmark. Provide your best answer."
                },
                {
                    "role": "user",
                    "content": scenario['question']
                }
            ],
            reward=0,
        )
        
        try:
            client = AsyncOpenAI(
                base_url=model.inference_base_url,
                api_key=model.inference_api_key,
            )
            
            chat_completion = await client.chat.completions.create(
                model=model.get_inference_name(),
                messages=trajectory.messages(),
                max_completion_tokens=512,
            )
            
            choice = chat_completion.choices[0]
            content = choice.message.content
            assert isinstance(content, str)
            trajectory.messages_and_choices.append(choice)
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Rollout error: {e}")
            trajectory.reward = -1
        
        return trajectory
    
    # Training loop (following tic_tac_toe pattern)
    print("\n2. Gathering trajectory groups with RULER scoring...")
    
    TRAINING_STEPS = 2
    ROLLOUTS_PER_STEP = 4
    LEARNING_RATE = 5e-5
    
    print(f"\nTraining configuration:")
    print(f"  Steps: {TRAINING_STEPS}")
    print(f"  Rollouts per scenario: {ROLLOUTS_PER_STEP}")
    print(f"  Learning rate: {LEARNING_RATE}")
    
    for i in range(await model.get_step(), TRAINING_STEPS):
        print(f"\nüìä Step {i + 1}/{TRAINING_STEPS}")
        
        train_groups = await art.gather_trajectory_groups(
            (
                art.TrajectoryGroup(
                    rollout(scenario)
                    for _ in range(ROLLOUTS_PER_STEP)
                )
                for scenario in training_scenarios
            ),
            after_each=lambda group: ruler_score_group(
                group,
                "openai/gpt-5-mini",
                swallow_exceptions=True
            ),
            pbar_desc="gather",
        )
        
        await model.delete_checkpoints()
        
        print(f"üîÑ Training model (step {i + 1})...")
        await model.train(
            train_groups,
            config=art.TrainConfig(learning_rate=LEARNING_RATE)
        )
        print(f"‚úÖ Step {i + 1} complete!")
    
    print("\n" + "=" * 70)
    print("‚úÖ Training Complete!")
    print("=" * 70)

# Run training
asyncio.run(register_and_train())


### What Happens During Real Training?

1. **Model Registration**: Connects to ART backend server
2. **Rollout Collection**: Agent runs on scenarios, collects trajectories
3. **RULER Scoring**: Each trajectory is scored (0.0-1.0)
4. **GRPO Training**: Model weights are updated based on scores
5. **Weight Updates**: Actual model parameters change

### Key Differences from Simulation

| Simulation (Previous) | Real Training (This) |
|----------------------|----------------------|
| Improved prompts | Actual weight updates |
| No backend needed | Requires ART backend |
| No model changes | Model weights change |
| Fast iteration | Slower (GPU recommended) |

### Troubleshooting

**Backend Not Running**:
```bash
art serve
```

**Model Not Found**:
- Check if model exists on Hugging Face
- Try a different base_model

**GPU Out of Memory**:
- Use a smaller model
- Reduce batch size
- Use gradient checkpointing