xflops · k82cn · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/examples/rl/.python-version b/examples/rl/.python-version
@@ -0,0 +1 @@
+3.12
diff --git a/examples/rl/README.md b/examples/rl/README.md
@@ -0,0 +1,249 @@
+# Reinforcement Learning Example (Python)
+
+## Motivation
+
+Reinforcement Learning (RL) training is computationally intensive, with episode collection (rollouts) being a major bottleneck. Each episode requires running the environment simulation, which can be slow for complex environments. Fortunately, episode collection is **embarrassingly parallel** — each episode is independent and can run on a separate worker.
+
+By leveraging the `flamepy.Runner` API, we can distribute episode collection across multiple executors in a Flame cluster, dramatically speeding up training while keeping the policy update logic centralized. This pattern is common in distributed RL systems like IMPALA, Ape-X, and SEED.
+
+This example illustrates:
+- How to parallelize RL episode collection using Flame Runner
+- The actor-learner pattern: distributed actors collect experience, centralized learner updates policy
+- How to serialize and broadcast PyTorch model weights to remote workers
+- Clean separation between distributed data collection and local gradient computation
+
+## Overview
+
+This example implements the REINFORCE (policy gradient) algorithm on environments from Gymnasium, with distributed episode collection using Flame.
+
+### Supported Environments
+
+| Environment | Type | Observation | Action | Episode Time |
+|-------------|------|-------------|--------|--------------|
+| `cartpole` | Discrete | 4 | 2 | ~1ms |
+| `halfcheetah` | Continuous (MuJoCo) | 17 | 6 | ~20ms |
+| `hopper` | Continuous (MuJoCo) | 11 | 3 | ~15ms |
+| `walker2d` | Continuous (MuJoCo) | 17 | 6 | ~20ms |
+| `ant` | Continuous (MuJoCo) | 105 | 8 | ~50ms |
+
+### How It Works
+
+1. **Policy Network**: Neural networks that output action distributions:
+   - `DiscretePolicy`: For CartPole (categorical distribution)
+   - `ContinuousPolicy`: For MuJoCo environments (Gaussian distribution with learned std)
+
+2. **Distributed Episode Collection**: Using `flamepy.Runner`, we create a service from the `collect_episode` function. Each call to this service runs on a remote executor that:
+   - Creates its own Gymnasium environment instance
+   - Loads the current policy weights (serialized and sent from the learner)
+   - Runs one complete episode
+   - Returns the collected experience (states, actions, rewards)
+
+3. **Centralized Policy Update**: After collecting experiences from all parallel episodes, the learner:
+   - Computes discounted rewards
+   - Calculates policy gradients
+   - Updates the policy network locally
+
+4. **Iteration**: The process repeats — broadcast new weights, collect more episodes, update again.
+
+### Files
+
+- **`main.py`**: REINFORCE training (distributed by default, use `--local` for local mode)
+- **`model.py`**: Shared components (policy networks, environment configs)
+- **`pyproject.toml`**: Package dependencies including `torch`, `gymnasium[mujoco]`, and `flamepy`
+- **`README.md`**: This documentation file
+
+### Key Benefits
+
+- **Linear Speedup**: Collecting N episodes in parallel takes roughly the same time as collecting 1 episode
+- **Minimal Code Changes**: The episode collection function is almost identical to single-threaded code
+- **Scalability**: Works with any number of executors — just change `episodes_per_iteration`
+- **Flexibility**: Includes local training mode for development and testing without a cluster
+
+## Usage
+
+### Prerequisites
+
+Start the Flame cluster with Docker Compose:
+
+```shell
+$ docker compose up -d
+```
+
+### Running Distributed Training
+
+Log into the flame-console and run the example:
+
+```shell
+$ docker compose exec -it flame-console /bin/bash
+root@container:/# cd /opt/examples/rl
+root@container:/opt/examples/rl# uv run main.py
+```
+
+### Command Line Options
+
+```shell
+# Distributed training with CartPole (default)
+uv run main.py
+
+# Distributed training with MuJoCo environments
+uv run main.py --env ant
+uv run main.py --env halfcheetah
+uv run main.py --env hopper
+uv run main.py --env walker2d
+
+# Local training (no Flame cluster required)
+uv run main.py --local
+uv run main.py --env ant --local
+
+# Custom training configuration
+uv run main.py --env ant --iterations 50 --episodes-per-iter 50
+
+# Show training plot (requires matplotlib)
+uv run main.py --plot
+```
+
+### Options
+
+| Flag | Description | Default |
+|------|-------------|---------|
+| `--env` | Environment: cartpole, halfcheetah, hopper, walker2d, ant | cartpole |
+| `--local` | Run local training (no Flame cluster) | Off |
+| `--iterations` | Number of training iterations | 100 |
+| `--episodes-per-iter` | Parallel episodes per iteration | 100 |
+| `--plot` | Show reward plot after training | Off |
+
+## Example Output
+
+### Distributed Training (MuJoCo Ant)
+
+```shell
+root@container:/opt/examples/rl# uv run main.py --env ant --iterations 20
+============================================================
+Distributed REINFORCE on Ant-v5 using Flame Runner
+============================================================
+
+Configuration:
+  Environment: Ant-v5
+  Observation dim: 105
+  Action dim: 8
+  Continuous actions: True
+  Training iterations: 20
+  Episodes per iteration: 100
+  Total episodes: 2000
+
+Starting distributed training...
+Iteration   0 | Mean Reward:   -431.5 | Loss: 0.7285
+Iteration  10 | Mean Reward:   -138.8 | Loss: 2.4785
+Iteration  19 | Mean Reward:   -122.4 | Loss: -7.4812
+
+============================================================
+Training Complete!
+  Total time: 85.23s
+  Episodes: 2000 (23.5 episodes/sec)
+  Final Mean Reward: -122.4
+============================================================
+```
+
+### Local Training
+
+```shell
+root@container:/opt/examples/rl# uv run main.py --env ant --iterations 20 --local
+============================================================
+Local REINFORCE on Ant-v5
+============================================================
+
+Configuration:
+  Environment: Ant-v5
+  Observation dim: 105
+  Action dim: 8
+  Continuous actions: True
+  Training iterations: 20
+  Episodes per iteration: 100
+  Total episodes: 2000
+
+Starting local training...
+Iteration   0 | Mean Reward:   -161.2 | Loss: -7.8887
+Iteration  10 | Mean Reward:   -120.5 | Loss: -2.9774
+Iteration  19 | Mean Reward:    -91.6 | Loss: 0.6673
+
+============================================================
+Training Complete!
+  Total time: 106.45s
+  Episodes: 2000 (18.8 episodes/sec)
+  Final Mean Reward: -91.6
+============================================================
+```
+
+## Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                    Learner (Local)                      │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
+│  │   Policy    │───▶│  Broadcast  │───▶│   Collect   │  │
+│  │   Update    │    │   Weights   │    │   Futures   │  │
+│  └─────────────┘    └─────────────┘    └─────────────┘  │
+│         ▲                                     │         │
+│         │              ┌──────────────────────┘         │
+│         │              ▼                                │
+│  ┌─────────────┐    ┌─────────────┐                     │
+│  │  Compute    │◀───│  Aggregate  │                     │
+│  │  Gradients  │    │  Episodes   │                     │
+│  └─────────────┘    └─────────────┘                     │
+└─────────────────────────────────────────────────────────┘
+                         │
+                         │ Flame Runner API
+                         ▼
+┌────────────────────────────────────────────────────────┐
+│                  Flame Cluster                         │
+│  ┌───────────┐  ┌───────────┐  ┌───────────┐           │
+│  │ Executor 1│  │ Executor 2│  │ Executor N│   ...     │
+│  │   ┌───┐   │  │   ┌───┐   │  │   ┌───┐   │           │
+│  │   │Env│   │  │   │Env│   │  │   │Env│   │           │
+│  │   └───┘   │  │   └───┘   │  │   └───┘   │           │
+│  │   Episode │  │   Episode │  │   Episode │           │
+│  │ Collection│  │ Collection│  │ Collection│           │
+│  └───────────┘  └───────────┘  └───────────┘           │
+└────────────────────────────────────────────────────────┘
+```
+
+## Performance
+
+### When Distribution Helps
+
+Distribution overhead is ~100ms per task. Speedup depends on episode duration:
+
+| Environment | Episode Time | Distributed Benefit |
+|-------------|--------------|---------------------|
+| CartPole | ~1ms | ❌ Overhead dominates |
+| Hopper | ~15ms | ⚠️ Marginal with high parallelism |
+| HalfCheetah | ~20ms | ⚠️ Marginal with high parallelism |
+| Ant | ~50ms | ✅ Benefits with 50+ episodes/iter |
+| Complex sims | >100ms | ✅✅ Near-linear speedup |
+| Real-world/expensive | >1s | ✅✅✅ Essential |
+
+### Maximizing Distributed Performance
+
+1. **Increase `--episodes-per-iter`**: More parallel episodes amortizes the per-iteration overhead (weight upload, session management)
+2. **Use heavier environments**: MuJoCo environments benefit more than CartPole
+3. **Scale executors**: More executors = more parallel episode collection
+
+### Scaling Behavior
+
+With N executors collecting episodes in parallel:
+
+| Executors | Episodes/Iteration | Theoretical Speedup | Actual Speedup* |
+|-----------|-------------------|---------------------|-----------------|
+| 1 | 100 | 1x | 1x |
+| 5 | 100 | 5x | ~4x |
+| 10 | 100 | 10x | ~7-8x |
+| 20 | 100 | 20x | ~12-15x |
+
+*Actual speedup limited by: network latency, executor startup, gradient aggregation time.
+
+### Best Practices
+
+1. **Use `--episodes-per-iter 100`** (default) or higher for expensive environments
+2. **Use local mode** (`--local`) for fast environments or development/debugging
+3. **Profile your environment** to determine if distribution is beneficial
+4. **Start with MuJoCo** (ant, halfcheetah) to see distributed benefits