# 06 — Infrastructure & Frameworks for RL Training

> **Purpose:** Understand the production stack for training LLMs with RL. This notebook covers the key components without requiring actual distributed infrastructure.

**The Modern RL4LLM Stack:**

| Component | Purpose | Tool |
|-----------|---------|------|
| Distributed Orchestration | Coordinate Actor/Critic/Reward models | **Ray** |
| Fast Generation | Accelerate sample generation | **vLLM** |
| Memory Optimization | Train 70B+ models | **DeepSpeed ZeRO** |
| Training Framework | End-to-end RLHF/GRPO | **OpenRLHF** / **TRL** |

---

In [1]:
# Note: This notebook demonstrates concepts and configurations.
# Actual distributed training requires a Ray cluster and GPUs.

import torch
import json
from typing import Dict, List, Any

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

PyTorch version: 2.9.0+cpu
CUDA available: False


## 1. The RLHF Bottleneck Problem

**Key insight:** Generation consumes 80% of training time.

```
Time Breakdown:
┌────────────────────────────────────────────────────────────┐
│████████████████████████████████████████████████░░░░░░░░░░░░│
│           GENERATION (80%)           │TRAIN(15%)│RWD(5%)│
└────────────────────────────────────────────────────────────┘
```

This is why vLLM (for fast inference) is critical.

In [2]:
def estimate_rlhf_time_breakdown(
    num_prompts: int,
    group_size: int,
    avg_response_tokens: int,
    tokens_per_second: float,  # Generation throughput
    train_time_per_batch: float,  # Seconds
) -> Dict[str, float]:
    """
    Estimate time breakdown for RLHF training step.
    
    Args:
        num_prompts: Number of prompts per batch
        group_size: Responses per prompt (K in GRPO)
        avg_response_tokens: Average response length
        tokens_per_second: Generation throughput (e.g., 100 for HF, 2000 for vLLM)
        train_time_per_batch: Time for gradient update
    
    Returns:
        Time breakdown dict
    """
    total_responses = num_prompts * group_size
    total_tokens = total_responses * avg_response_tokens
    
    gen_time = total_tokens / tokens_per_second
    reward_time = total_responses * 0.01  # Assume 10ms per reward
    total_time = gen_time + train_time_per_batch + reward_time
    
    return {
        'generation_sec': gen_time,
        'training_sec': train_time_per_batch,
        'reward_sec': reward_time,
        'total_sec': total_time,
        'generation_pct': gen_time / total_time * 100,
        'training_pct': train_time_per_batch / total_time * 100,
    }

In [3]:
# Compare HuggingFace vs vLLM generation speed

params = dict(
    num_prompts=32,
    group_size=8,
    avg_response_tokens=500,
    train_time_per_batch=5.0,
)

hf_breakdown = estimate_rlhf_time_breakdown(**params, tokens_per_second=100)
vllm_breakdown = estimate_rlhf_time_breakdown(**params, tokens_per_second=2400)

print("HuggingFace Transformers (100 tok/s):")
print(f"  Generation: {hf_breakdown['generation_sec']:.1f}s ({hf_breakdown['generation_pct']:.1f}%)")
print(f"  Training: {hf_breakdown['training_sec']:.1f}s ({hf_breakdown['training_pct']:.1f}%)")
print(f"  Total: {hf_breakdown['total_sec']:.1f}s")

print("\nvLLM (2400 tok/s = 24x faster):")
print(f"  Generation: {vllm_breakdown['generation_sec']:.1f}s ({vllm_breakdown['generation_pct']:.1f}%)")
print(f"  Training: {vllm_breakdown['training_sec']:.1f}s ({vllm_breakdown['training_pct']:.1f}%)")
print(f"  Total: {vllm_breakdown['total_sec']:.1f}s")

speedup = hf_breakdown['total_sec'] / vllm_breakdown['total_sec']
print(f"\nOverall speedup: {speedup:.1f}x")

HuggingFace Transformers (100 tok/s):
  Generation: 1280.0s (99.4%)
  Training: 5.0s (0.4%)
  Total: 1287.6s

vLLM (2400 tok/s = 24x faster):
  Generation: 53.3s (87.6%)
  Training: 5.0s (8.2%)
  Total: 60.9s

Overall speedup: 21.1x


## 2. vLLM: PagedAttention for Fast Generation

**Key innovation:** Paging for KV cache reduces memory waste from 60-80% to <4%.

```
Traditional KV Cache:              vLLM PagedAttention:
┌────────────────────┐            ┌──┬──┬──┬──┬──┬──┐
│ Allocated for max  │            │P1│P1│P2│P3│P3│P3│  <- Blocks
│ possible length    │            └──┴──┴──┴──┴──┴──┘
│   (WASTED!)        │              Allocated on-demand
│                    │              Shared across requests
└────────────────────┘
```

In [4]:
def estimate_kv_cache_memory(
    batch_size: int,
    seq_len: int,
    num_layers: int,
    num_heads: int,
    head_dim: int,
    dtype_bytes: int = 2,  # FP16
) -> Dict[str, float]:
    """
    Estimate KV cache memory requirements.
    
    Memory = 2 * batch * seq * layers * heads * head_dim * dtype
    (2x for K and V)
    """
    kv_memory = 2 * batch_size * seq_len * num_layers * num_heads * head_dim * dtype_bytes
    kv_memory_gb = kv_memory / (1024 ** 3)
    
    return {
        'kv_cache_bytes': kv_memory,
        'kv_cache_gb': kv_memory_gb,
    }


# Example: Llama-7B style model
kv_estimate = estimate_kv_cache_memory(
    batch_size=32,
    seq_len=2048,
    num_layers=32,
    num_heads=32,
    head_dim=128,
)
print(f"KV Cache for batch=32, seq=2048:")
print(f"  Memory: {kv_estimate['kv_cache_gb']:.2f} GB")
print(f"  With 60% waste (traditional): {kv_estimate['kv_cache_gb'] * 1.6:.2f} GB")
print(f"  With 4% waste (vLLM): {kv_estimate['kv_cache_gb'] * 1.04:.2f} GB")

KV Cache for batch=32, seq=2048:
  Memory: 32.00 GB
  With 60% waste (traditional): 51.20 GB
  With 4% waste (vLLM): 33.28 GB


### vLLM Configuration for RLHF

In [5]:
# vLLM configuration for OpenRLHF integration

vllm_config = {
    # Engine settings
    "model": "meta-llama/Llama-2-7b-hf",
    "tensor_parallel_size": 1,  # GPUs per model copy
    "dtype": "bfloat16",
    "seed": 42,
    
    # Memory management
    "gpu_memory_utilization": 0.9,  # Use 90% of GPU memory
    "max_model_len": 4096,  # Maximum sequence length
    "enable_prefix_caching": True,  # Share prefixes across requests
    
    # Sampling
    "max_num_seqs": 256,  # Concurrent sequences
    "max_num_batched_tokens": 8192,  # Tokens per batch
}

print("vLLM Configuration for RLHF:")
print(json.dumps(vllm_config, indent=2))

vLLM Configuration for RLHF:
{
  "model": "meta-llama/Llama-2-7b-hf",
  "tensor_parallel_size": 1,
  "dtype": "bfloat16",
  "seed": 42,
  "gpu_memory_utilization": 0.9,
  "max_model_len": 4096,
  "enable_prefix_caching": true,
  "max_num_seqs": 256,
  "max_num_batched_tokens": 8192
}


## 3. DeepSpeed ZeRO: Memory-Efficient Training

**Three stages of memory optimization:**

| Stage | Partitions | Memory Reduction |
|-------|-----------|------------------|
| ZeRO-1 | Optimizer states | 4x |
| ZeRO-2 | + Gradients | 8x |
| ZeRO-3 | + Parameters | Linear with GPUs |

In [6]:
def estimate_training_memory(
    param_count: int,
    batch_size: int,
    seq_len: int,
    num_gpus: int = 1,
    zero_stage: int = 0,
) -> Dict[str, float]:
    """
    Estimate GPU memory for training with different ZeRO stages.
    
    Memory components:
    - Parameters: 2 bytes/param (FP16)
    - Gradients: 2 bytes/param
    - Optimizer (Adam): 12 bytes/param (FP32 params + momentum + variance)
    - Activations: ~batch * seq * hidden * layers (approximated)
    """
    bytes_per_param = 2  # FP16
    
    # Base memory (no ZeRO)
    param_memory = param_count * bytes_per_param
    grad_memory = param_count * bytes_per_param
    optimizer_memory = param_count * 12  # Adam states
    
    total_base = param_memory + grad_memory + optimizer_memory
    
    # Apply ZeRO partitioning
    if zero_stage == 0:
        memory_per_gpu = total_base
    elif zero_stage == 1:
        # Partition optimizer states
        memory_per_gpu = param_memory + grad_memory + (optimizer_memory / num_gpus)
    elif zero_stage == 2:
        # Partition optimizer + gradients
        memory_per_gpu = param_memory + (grad_memory + optimizer_memory) / num_gpus
    elif zero_stage == 3:
        # Partition everything
        memory_per_gpu = (param_memory + grad_memory + optimizer_memory) / num_gpus
    else:
        raise ValueError(f"Invalid ZeRO stage: {zero_stage}")
    
    return {
        'total_memory_gb': total_base / (1024 ** 3),
        'per_gpu_gb': memory_per_gpu / (1024 ** 3),
        'reduction_factor': total_base / memory_per_gpu,
    }

In [7]:
# Compare ZeRO stages for 7B model

param_count = 7_000_000_000  # 7B parameters
num_gpus = 8

print(f"Memory requirements for {param_count/1e9:.0f}B model on {num_gpus} GPUs:")
print(f"{'ZeRO Stage':<12} {'Per GPU (GB)':<15} {'Reduction':<10}")
print("-" * 40)

for stage in [0, 1, 2, 3]:
    mem = estimate_training_memory(param_count, 1, 2048, num_gpus, stage)
    print(f"Stage {stage:<7} {mem['per_gpu_gb']:<15.1f} {mem['reduction_factor']:<10.1f}x")

Memory requirements for 7B model on 8 GPUs:
ZeRO Stage   Per GPU (GB)    Reduction 
----------------------------------------
Stage 0       104.3           1.0       x
Stage 1       35.9            2.9       x
Stage 2       24.4            4.3       x
Stage 3       13.0            8.0       x


### DeepSpeed Configuration Files

In [8]:
# DeepSpeed ZeRO-3 configuration for RLHF

deepspeed_config_zero3 = {
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    
    "bf16": {
        "enabled": True
    },
    
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        
        # CPU offloading (for memory-constrained setups)
        "offload_param": {
            "device": "none",  # Set to "cpu" to enable
            "pin_memory": True
        },
        "offload_optimizer": {
            "device": "none",  # Set to "cpu" to enable
            "pin_memory": True
        },
    },
    
    "gradient_clipping": 1.0,
}

print("DeepSpeed ZeRO-3 Config:")
print(json.dumps(deepspeed_config_zero3, indent=2))

DeepSpeed ZeRO-3 Config:
{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "bf16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "offload_param": {
      "device": "none",
      "pin_memory": true
    },
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    }
  },
  "gradient_clipping": 1.0
}


## 4. Ray: Distributed Orchestration

Ray coordinates the four models in RLHF across GPUs:

```
┌─────────────────────────────────────────────────────────────┐
│                     Ray Placement Groups                     │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│   Actor     │   Critic    │   Reward    │    Reference      │
│  (Train)    │   (Train)   │  (Inference)│   (Inference)     │
│   GPU 0-3   │   GPU 4-5   │   GPU 6     │     GPU 7         │
└─────────────┴─────────────┴─────────────┴───────────────────┘
```

In [9]:
# Ray concepts demonstration (simulated without actual Ray)

class RayActorSimulator:
    """
    Simulates Ray actor behavior for understanding.
    In production, use @ray.remote decorator.
    """
    
    def __init__(self, name: str, num_gpus: int):
        self.name = name
        self.num_gpus = num_gpus
        self.state = {}
    
    def get_resources(self) -> Dict:
        return {"name": self.name, "num_gpus": self.num_gpus}
    
    def update_weights(self, weights: Dict) -> None:
        """Receive updated weights from training."""
        self.state['weights_version'] = weights.get('version', 0)
    
    def generate(self, prompts: List[str]) -> List[str]:
        """Generate responses (placeholder)."""
        return [f"Response to: {p[:20]}..." for p in prompts]


class OpenRLHFSimulator:
    """
    Simulates OpenRLHF's distributed architecture.
    """
    
    def __init__(self, config: Dict):
        self.config = config
        
        # Create actors for each model
        self.actor = RayActorSimulator("actor", config.get('actor_gpus', 4))
        self.critic = RayActorSimulator("critic", config.get('critic_gpus', 2))
        self.reward = RayActorSimulator("reward", config.get('reward_gpus', 1))
        self.reference = RayActorSimulator("reference", config.get('ref_gpus', 1))
    
    def get_resource_allocation(self) -> Dict:
        """Show resource allocation."""
        return {
            'actor': self.actor.get_resources(),
            'critic': self.critic.get_resources(),
            'reward': self.reward.get_resources(),
            'reference': self.reference.get_resources(),
            'total_gpus': sum([
                self.actor.num_gpus,
                self.critic.num_gpus,
                self.reward.num_gpus,
                self.reference.num_gpus,
            ])
        }

In [10]:
# Example resource allocation

simulator = OpenRLHFSimulator({
    'actor_gpus': 4,
    'critic_gpus': 2,
    'reward_gpus': 1,
    'ref_gpus': 1,
})

allocation = simulator.get_resource_allocation()
print("OpenRLHF Resource Allocation:")
for model, resources in allocation.items():
    if model != 'total_gpus':
        print(f"  {model}: {resources['num_gpus']} GPUs")
print(f"  Total: {allocation['total_gpus']} GPUs")

OpenRLHF Resource Allocation:
  actor: 4 GPUs
  critic: 2 GPUs
  reward: 1 GPUs
  reference: 1 GPUs
  Total: 8 GPUs


## 5. OpenRLHF: Complete Training Framework

### Command-Line Interface Examples

In [11]:
# OpenRLHF CLI command generator

def generate_openrlhf_command(
    algorithm: str = 'grpo',
    pretrain: str = 'meta-llama/Llama-2-7b-hf',
    reward_model: str = 'OpenRLHF/reward-model',
    dataset: str = 'OpenRLHF/prompt-collection',
    num_gpus: int = 8,
    use_vllm: bool = True,
    colocate: bool = False,
) -> str:
    """
    Generate OpenRLHF training command.
    """
    cmd = [
        "python", "-m", "openrlhf.cli.train_ppo_ray",
        f"--pretrain {pretrain}",
        f"--reward_pretrain {reward_model}",
        f"--prompt_data {dataset}",
        
        # Algorithm selection
        f"--advantage_estimator {'group_norm' if algorithm == 'grpo' else 'gae'}",
        "--use_kl_loss" if algorithm == 'grpo' else "",
        
        # Training params
        "--actor_learning_rate 1e-5",
        "--critic_learning_rate 5e-6",
        "--init_kl_coef 0.1",
        "--eps_clip 0.2",
        "--max_epochs 4",
        "--train_batch_size 128",
        "--rollout_batch_size 32",
        
        # vLLM settings
        f"--vllm_num_engines {2 if use_vllm else 0}",
        "--vllm_gpu_memory_utilization 0.9" if use_vllm else "",
        
        # Colocation
        "--colocate_all_models" if colocate else "",
        
        # Resource allocation
        f"--actor_num_gpus_per_node {num_gpus // 2}",
        "--critic_num_gpus_per_node 2",
        "--reward_num_gpus_per_node 1",
        "--ref_num_gpus_per_node 1",
    ]
    
    # Filter empty strings and join
    return " \\\n  ".join([c for c in cmd if c])

In [12]:
# Generate GRPO training command

grpo_cmd = generate_openrlhf_command(
    algorithm='grpo',
    num_gpus=8,
    use_vllm=True,
)

print("OpenRLHF GRPO Training Command:")
print("=" * 60)
print(grpo_cmd)

OpenRLHF GRPO Training Command:
python \
  -m \
  openrlhf.cli.train_ppo_ray \
  --pretrain meta-llama/Llama-2-7b-hf \
  --reward_pretrain OpenRLHF/reward-model \
  --prompt_data OpenRLHF/prompt-collection \
  --advantage_estimator group_norm \
  --use_kl_loss \
  --actor_learning_rate 1e-5 \
  --critic_learning_rate 5e-6 \
  --init_kl_coef 0.1 \
  --eps_clip 0.2 \
  --max_epochs 4 \
  --train_batch_size 128 \
  --rollout_batch_size 32 \
  --vllm_num_engines 2 \
  --vllm_gpu_memory_utilization 0.9 \
  --actor_num_gpus_per_node 4 \
  --critic_num_gpus_per_node 2 \
  --reward_num_gpus_per_node 1 \
  --ref_num_gpus_per_node 1


In [13]:
# Generate memory-efficient command (colocate + smaller GPU count)

efficient_cmd = generate_openrlhf_command(
    algorithm='grpo',
    num_gpus=4,
    use_vllm=True,
    colocate=True,
)

print("Memory-Efficient Configuration (4 GPUs, colocated):")
print("=" * 60)
print(efficient_cmd)

Memory-Efficient Configuration (4 GPUs, colocated):
python \
  -m \
  openrlhf.cli.train_ppo_ray \
  --pretrain meta-llama/Llama-2-7b-hf \
  --reward_pretrain OpenRLHF/reward-model \
  --prompt_data OpenRLHF/prompt-collection \
  --advantage_estimator group_norm \
  --use_kl_loss \
  --actor_learning_rate 1e-5 \
  --critic_learning_rate 5e-6 \
  --init_kl_coef 0.1 \
  --eps_clip 0.2 \
  --max_epochs 4 \
  --train_batch_size 128 \
  --rollout_batch_size 32 \
  --vllm_num_engines 2 \
  --vllm_gpu_memory_utilization 0.9 \
  --colocate_all_models \
  --actor_num_gpus_per_node 2 \
  --critic_num_gpus_per_node 2 \
  --reward_num_gpus_per_node 1 \
  --ref_num_gpus_per_node 1


## 6. TRL: HuggingFace Trainer Interface

For simpler setups, TRL provides higher-level trainers.

In [14]:
# TRL GRPOTrainer configuration template

trl_grpo_config = {
    # Model
    "model_name": "meta-llama/Llama-2-7b-hf",
    
    # GRPO specific
    "num_generation": 8,  # Group size (K)
    "beta": 0.1,  # KL coefficient
    
    # Training
    "learning_rate": 1e-5,
    "per_device_train_batch_size": 8,  # Must be multiple of num_generation
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 3,
    
    # Generation
    "max_new_tokens": 512,
    "temperature": 1.0,
    
    # vLLM acceleration
    "use_vllm": True,
    "vllm_mode": "colocate",
    "vllm_gpu_memory_utilization": 0.9,
    
    # Optimization
    "bf16": True,
    "gradient_checkpointing": True,
}

print("TRL GRPOTrainer Configuration:")
print(json.dumps(trl_grpo_config, indent=2))

TRL GRPOTrainer Configuration:
{
  "model_name": "meta-llama/Llama-2-7b-hf",
  "num_generation": 8,
  "beta": 0.1,
  "learning_rate": 1e-05,
  "per_device_train_batch_size": 8,
  "gradient_accumulation_steps": 4,
  "num_train_epochs": 3,
  "max_new_tokens": 512,
  "temperature": 1.0,
  "use_vllm": true,
  "vllm_mode": "colocate",
  "vllm_gpu_memory_utilization": 0.9,
  "bf16": true,
  "gradient_checkpointing": true
}


In [15]:
# TRL usage example (pseudocode - requires actual model)

trl_example_code = '''
from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Define reward function
def reward_fn(completions: list[str]) -> list[float]:
    """Your reward function here."""
    return [len(c) / 100 for c in completions]  # Example: reward length

# Configure trainer
config = GRPOConfig(
    num_generation=8,
    beta=0.1,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    use_vllm=True,
)

# Create trainer
trainer = GRPOTrainer(
    model=model,
    config=config,
    tokenizer=tokenizer,
    reward_funcs=[reward_fn],
    train_dataset=dataset,
)

# Train
trainer.train()
'''

print("TRL GRPOTrainer Usage:")
print(trl_example_code)

TRL GRPOTrainer Usage:

from trl import GRPOConfig, GRPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Define reward function
def reward_fn(completions: list[str]) -> list[float]:
    """Your reward function here."""
    return [len(c) / 100 for c in completions]  # Example: reward length

# Configure trainer
config = GRPOConfig(
    num_generation=8,
    beta=0.1,
    learning_rate=1e-5,
    per_device_train_batch_size=8,
    use_vllm=True,
)

# Create trainer
trainer = GRPOTrainer(
    model=model,
    config=config,
    tokenizer=tokenizer,
    reward_funcs=[reward_fn],
    train_dataset=dataset,
)

# Train
trainer.train()



## 7. Async Training: Decoupling Generation and Training

**Problem:** Synchronous training wastes GPU time.

```
Synchronous:
GPU 0: [GENERATE.....][IDLE][TRAIN][IDLE][GENERATE.....]
GPU 1: [IDLE][GENERATE.....][IDLE][TRAIN][IDLE]

Asynchronous:
GPU 0 (Gen):   [GENERATE][GENERATE][GENERATE][GENERATE]
GPU 1 (Train): [TRAIN][TRAIN][TRAIN][TRAIN][TRAIN]
```

**Key insight:** DPO is robust to off-policy data, enabling 25-40% speedups.

In [16]:
class AsyncTrainingSimulator:
    """
    Demonstrates async training concept.
    
    In production, OpenRLHF uses --async_train flag.
    """
    
    def __init__(self, staleness_tolerance: int = 2):
        self.staleness_tolerance = staleness_tolerance
        self.generation_version = 0
        self.training_version = 0
        self.buffer = []
    
    def generate_batch(self) -> Dict:
        """Simulate generation with current policy version."""
        self.generation_version += 1
        batch = {
            'data': f'batch_{self.generation_version}',
            'policy_version': self.generation_version,
        }
        self.buffer.append(batch)
        return batch
    
    def train_step(self) -> Dict:
        """Train on buffered data (may be off-policy)."""
        if not self.buffer:
            return {'status': 'waiting'}
        
        batch = self.buffer.pop(0)
        self.training_version += 1
        
        staleness = self.training_version - batch['policy_version']
        
        return {
            'batch': batch['data'],
            'training_version': self.training_version,
            'batch_version': batch['policy_version'],
            'staleness': staleness,
            'is_on_policy': staleness <= self.staleness_tolerance,
        }
    
    def simulate(self, steps: int) -> List[Dict]:
        """Simulate async training."""
        results = []
        
        for _ in range(steps):
            # Generation runs continuously
            self.generate_batch()
            self.generate_batch()  # 2x faster than training
            
            # Training runs on available data
            result = self.train_step()
            results.append(result)
        
        return results

In [17]:
# Simulate async training

simulator = AsyncTrainingSimulator(staleness_tolerance=2)
results = simulator.simulate(5)

print("Async Training Simulation:")
print(f"{'Step':<6} {'Train Ver':<10} {'Batch Ver':<10} {'Stale':<8} {'On-Policy':<10}")
print("-" * 50)

for i, r in enumerate(results):
    if 'staleness' in r:
        print(f"{i:<6} {r['training_version']:<10} {r['batch_version']:<10} "
              f"{r['staleness']:<8} {'✓' if r['is_on_policy'] else '✗':<10}")

Async Training Simulation:
Step   Train Ver  Batch Ver  Stale    On-Policy 
--------------------------------------------------
0      1          1          0        ✓         
1      2          2          0        ✓         
2      3          3          0        ✓         
3      4          4          0        ✓         
4      5          5          0        ✓         


## 8. Summary: Production Stack Checklist

| Component | Small Scale (1-4 GPUs) | Large Scale (8+ GPUs) |
|-----------|------------------------|----------------------|
| Framework | TRL `GRPOTrainer` | OpenRLHF |
| Generation | vLLM colocated | vLLM separate engines |
| Memory | DeepSpeed ZeRO-2 | DeepSpeed ZeRO-3 |
| Distributed | Accelerate | Ray |
| Training | Synchronous | Async (--async_train) |

### Quick Start Commands

```bash
# Install
pip install openrlhf vllm deepspeed

# Single-node GRPO (4 GPUs)
python -m openrlhf.cli.train_ppo_ray \
  --pretrain meta-llama/Llama-2-7b-hf \
  --advantage_estimator group_norm \
  --use_kl_loss \
  --vllm_num_engines 1 \
  --colocate_all_models

# Multi-node (Ray cluster)
ray start --head  # On head node
ray start --address=<head-ip>  # On worker nodes
# Then run training command
```

---
**Next:** `07_dpo_alternatives.ipynb` (DPO, ORPO, SimPO)