Tip
Check out the environments page to learn more about different environment types.
Algorithm | Frameworks | Discrete Actions | Continuous Actions | Multi-Agent | Model Support | Multi-GPU |
---|---|---|---|---|---|---|
APPO | tf + torch | Yes +parametric | Yes | Yes | +RNN, +LSTM auto-wrapping, +Attention, +autoreg | tf + torch |
tf + torch tf + torch |
Yes +parametric No |
Yes Yes |
Yes No |
torch tf + torch |
||
tf tf + torch |
Yes Yes +parametric |
Yes No |
No Yes |
+RNN (GRU-based by default) |
tf tf + torch |
|
IMPALA | tf + torch | Yes +parametric | Yes | Yes | +RNN, +LSTM auto-wrapping, +Attention, +autoreg | tf + torch |
MARWIL | tf + torch | Yes +parametric | Yes | Yes | +RNN | torch |
tf + torch tf + torch |
Yes +parametric Yes |
Yes Yes |
Yes Yes |
tf + torch torch |
Multi-Agent only Methods
Algorithm | Frameworks | Discrete Actions | Continuous Actions | Multi-Agent | Model Support |
---|---|---|---|---|---|
Parameter Sharing | Depends on | bootstrapped algorithm | |||
--------------------------------Fully Independent Learning |
-----------Depends on |
------------------------bootstrapped algorithm |
|||
--------------------------------Shared Critic Methods |
-----------Depends on |
------------------------bootstrapped algorithm |
Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta
parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.
Tuned examples: CartPole-v1
BC-specific configs (see also common configs):
ray.rllib.algorithms.bc.bc.BCConfig
In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).
RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters
parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.
Tuned examples: HalfCheetah Random, Hopper Random
CQL-specific configs (see also common configs):
ray.rllib.algorithms.cql.cql.CQLConfig
MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the beta
hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see BC). MARWIL requires the offline datasets API to be used.
Tuned examples: CartPole-v1
MARWIL-specific configs (see also common configs):
ray.rllib.algorithms.marwil.marwil.MARWILConfig
[paper] [implementation] We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.
Tip
APPO is not always more efficient; it is often better to use standard PPO <ppo>
or IMPALA <impala>
.
Tuned examples: PongNoFrameskip-v4
APPO-specific configs (see also common configs):
ray.rllib.algorithms.appo.appo.APPOConfig
[paper] [implementation] PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.
Tip
If you need to scale out with GPUs on multiple nodes, consider using decentralized PPO.
Tuned examples: Unity3D Soccer (multi-agent: Strikers vs Goalie), Humanoid-v1, Hopper-v1, Pendulum-v1, PongDeterministic-v4, Walker2d-v1, HalfCheetah-v2, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4
Atari results: more details
Atari env | RLlib PPO @10M | RLlib PPO @25M | Baselines PPO @10M |
---|---|---|---|
BeamRider | 2807 | 4480 | ~1800 |
Breakout | 104 | 201 | ~250 |
Qbert | 11085 | 14247 | ~14000 |
SpaceInvaders | 671 | 944 | ~800 |
Scalability: more details
MuJoCo env | RLlib PPO 16-workers @ 1h | Fan et al PPO 16-workers @ 1h |
---|---|---|
HalfCheetah | 9664 | ~7700 |
PPO-specific configs (see also common configs):
ray.rllib.algorithms.ppo.ppo.PPOConfig
[paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference V-trace code. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a custom model. Multiple learner GPUs and experience replay are also supported.
IMPALA architectureTuned examples: PongNoFrameskip-v4, vectorized configuration, multi-gpu configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4
Atari results @10M steps: more details
Atari env | RLlib IMPALA 32-workers | Mnih et al A3C 16-workers |
---|---|---|
BeamRider | 2071 | ~3000 |
Breakout | 385 | ~150 |
Qbert | 4068 | ~1000 |
SpaceInvaders | 719 | ~600 |
Scalability:
Atari env | RLlib IMPALA 32-workers @1 hour | Mnih et al A3C 16-workers @1 hour |
---|---|---|
BeamRider | 3181 | ~1000 |
Breakout | 538 | ~10 |
Qbert | 10850 | ~500 |
SpaceInvaders | 843 | ~300 |
IMPALA-specific configs (see also common configs):
ray.rllib.algorithms.impala.impala.ImpalaConfig
[paper] [implementation] DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in Rainbow are available, though not all are enabled by default. See also how to use parametric-actions in DQN.
DQN architectureTuned examples: PongDeterministic-v4, Rainbow configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4, with Dueling and Double-Q, with Distributional DQN.
Tip
Consider using Ape-X for faster training with similar timestep efficiency.
Hint
For a complete rainbow setup, make the following changes to the default DQN config: "n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0
(set v_min
and v_max
according to your expected range of returns).
Atari results @10M steps: more details
Atari env | RLlib DQN | RLlib Dueling DDQN | RLlib Dist. DQN | Hessel et al. DQN |
---|---|---|---|---|
BeamRider | 2869 | 1910 | 4447 | ~2000 |
Breakout | 287 | 312 | 410 | ~150 |
Qbert | 3921 | 7968 | 15780 | ~4000 |
SpaceInvaders | 650 | 1001 | 1025 | ~500 |
DQN-specific configs (see also common configs):
ray.rllib.algorithms.dqn.dqn.DQNConfig
[original paper], [follow up paper], [discrete actions paper] [implementation]
SAC architecture (same as DQN)RLlib's soft-actor critic implementation is ported from the official SAC repo to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: policy_model_config
and q_model_config
, the model
field of the config will be ignored.
Tuned examples (continuous actions): Pendulum-v1, HalfCheetah-v3, Tuned examples (discrete actions): CartPole-v1
MuJoCo results @3M steps: more details
MuJoCo env | RLlib SAC | Haarnoja et al SAC |
---|---|---|
HalfCheetah | 13000 | ~15000 |
SAC-specific configs (see also common configs):
ray.rllib.algorithms.sac.sac.SACConfig
DreamerV3 trains a world model in supervised fashion using real environment interactions. The world model's objective is to correctly predict all aspects of the transition dynamics of the RL environment, which includes (besides predicting the correct next observations) predicting the received rewards as well as a boolean episode continuation flag. A "recurrent state space model" or RSSM is used to alternatingly train the world model (from actual env data) as well as the critic and actor networks, both of which are trained on "dreamed" trajectories produced by the world model.
DreamerV3 can be used in all types of environments, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions.
Tuned examples: Atari 100k, Atari 200M, DeepMind Control Suite
Pong-v5 results (1, 2, and 4 GPUs):
Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.Atari 100k results (1 vs 4 GPUs):
Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.DeepMind Control Suite (vision) results (1 vs 4 GPUs):
Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.[paper], [paper] and [instructions]. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.
Tuned examples: PettingZoo, waterworld, rock-paper-scissors, multi-agent cartpole
[instructions] Shared critic methods are when all agents use a single parameter shared critic network (in some cases with access to more of the observation space than agents can see). Note that many specialized multi-agent algorithms such as MADDPG are mostly shared critic forms of their single-agent algorithm (DDPG in the case of MADDPG).
Tuned examples: TwoStepGame
[instructions] Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.
Tuned examples: waterworld, multiagent-cartpole