Add RL example.#424
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a distributed Reinforcement Learning example using the REINFORCE algorithm and the Flame Runner API, supporting various Gymnasium and MuJoCo environments. It also includes infrastructure updates to optimize Python package installation via uv and improves the dynamic discovery of shared libraries for the environment's LD_LIBRARY_PATH. Feedback focuses on improving the modularity of the RL example by extracting shared components into a separate module to avoid fragile imports during remote execution, and optimizing the performance of the discounted reward calculation.
| import numpy as np | ||
| import torch | ||
|
|
||
| from main import ENV_CONFIGS, create_policy |
There was a problem hiding this comment.
The collect_episode function, which is designed to run on remote executors, imports components directly from the main script. This creates a tight coupling and can be fragile, especially with code serialization and remote execution. It's generally better to avoid such imports from the main executable script within a distributed task.
To improve modularity and robustness, I recommend extracting shared components into a separate file (e.g., model.py). This file could contain:
EnvConfigdataclassENV_CONFIGSdictionaryDiscretePolicyandContinuousPolicyclassescreate_policyfunction
Both main.py and collect_episode can then import from this new module. This change will make the code cleaner and less prone to issues in a distributed environment.
| discounted = [] | ||
| R = 0 | ||
| for r in reversed(rewards): | ||
| R = r + gamma * R | ||
| discounted.insert(0, R) |
There was a problem hiding this comment.
The current implementation for calculating discounted rewards uses list.insert(0, ...) inside a loop. This has a time complexity of O(n^2), where n is the episode length, which can be inefficient for long episodes.
You can improve the performance to O(n) by pre-allocating a list and filling it in reverse.
| discounted = [] | |
| R = 0 | |
| for r in reversed(rewards): | |
| R = r + gamma * R | |
| discounted.insert(0, R) | |
| discounted = [0.0] * len(rewards) | |
| R = 0 | |
| for i in range(len(rewards) - 1, -1, -1): | |
| R = rewards[i] + gamma * R | |
| discounted[i] = R |
…mize discounted rewards - Extract EnvConfig, ENV_CONFIGS, DiscretePolicy, ContinuousPolicy, create_policy into model.py - Update collect_episode to import from model instead of main (fixes fragile distributed imports) - Optimize compute_discounted_rewards from O(n²) to O(n) using pre-allocated list - Update README.md to document new file structure Addresses PR review feedback from gemini-code-assist
Fixes build error: 'Multiple top-level modules discovered in a flat-layout' after adding model.py alongside main.py
…ilures - Add '|| true' to mkdir commands to prevent failure with 'set -e' - Add '-r' flag to xargs to not run if input is empty - Add empty line check in while loop - Redirect xargs stderr to /dev/null
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
No description provided.