Algorithms

Tip

Check out the environments page to learn more about different environment types.

Available Algorithms - Overview

Algorithm	Frameworks	Discrete Actions	Continuous Actions	Multi-Agent	Model Support	Multi-GPU
APPO	tf + torch	Yes +parametric	Yes	Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg	tf + torch
BC CQL	tf + torch tf + torch	Yes +parametric No	Yes Yes	Yes No	+RNN	torch tf + torch
DreamerV3 DQN, Rainbow	tf tf + torch	Yes Yes +parametric	Yes No	No Yes	+RNN (GRU-based by default)	tf tf + torch
IMPALA	tf + torch	Yes +parametric	Yes	Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg	tf + torch
MARWIL	tf + torch	Yes +parametric	Yes	Yes	+RNN	torch
PPO SAC	tf + torch tf + torch	Yes +parametric Yes	Yes Yes	Yes Yes	+RNN, +LSTM auto-wrapping, +Attention, +autoreg	tf + torch torch

Multi-Agent only Methods

Algorithm	Frameworks	Discrete Actions
Parameter Sharing	Depends on	bootstrapped algorithm
--------------------------------Fully Independent Learning	-----------Depends on	------------------------bootstrapped algorithm
--------------------------------Shared Critic Methods	-----------Depends on	------------------------bootstrapped algorithm

Offline

Behavior Cloning (BC; derived from MARWIL implementation)

[paper] [implementation]

Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.

Tuned examples: CartPole-v1

BC-specific configs (see also common configs):

ray.rllib.algorithms.bc.bc.BCConfig

Conservative Q-Learning (CQL)

[paper] [implementation]

In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).

RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.

Tuned examples: HalfCheetah Random, Hopper Random

CQL-specific configs (see also common configs):

ray.rllib.algorithms.cql.cql.CQLConfig

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)

[paper] [implementation]

MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the beta hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see BC). MARWIL requires the offline datasets API to be used.

Tuned examples: CartPole-v1

MARWIL-specific configs (see also common configs):

ray.rllib.algorithms.marwil.marwil.MARWILConfig

Model-free On-policy RL

Asynchronous Proximal Policy Optimization (APPO)

[paper] [implementation] We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.

Tip

APPO is not always more efficient; it is often better to use standard PPO <ppo> or IMPALA <impala>.

APPO architecture (same as IMPALA)

Tuned examples: PongNoFrameskip-v4

APPO-specific configs (see also common configs):

ray.rllib.algorithms.appo.appo.APPOConfig

Proximal Policy Optimization (PPO)

[paper] [implementation] PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.

Tip

If you need to scale out with GPUs on multiple nodes, consider using decentralized PPO.

PPO architecture

Tuned examples: Unity3D Soccer (multi-agent: Strikers vs Goalie), Humanoid-v1, Hopper-v1, Pendulum-v1, PongDeterministic-v4, Walker2d-v1, HalfCheetah-v2, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results: more details

Atari env	RLlib PPO @10M	RLlib PPO @25M	Baselines PPO @10M
BeamRider	2807	4480	~1800
Breakout	104	201	~250
Qbert	11085	14247	~14000
SpaceInvaders	671	944	~800

Scalability: more details

MuJoCo env	RLlib PPO 16-workers @ 1h	Fan et al PPO 16-workers @ 1h
HalfCheetah	9664	~7700

RLlib's multi-GPU PPO scales to multiple GPUs and hundreds of CPUs on solving the Humanoid-v1 task. Here we compare against a reference MPI-based implementation.

PPO-specific configs (see also common configs):

ray.rllib.algorithms.ppo.ppo.PPOConfig

Importance Weighted Actor-Learner Architecture (IMPALA)

[paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference V-trace code. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a custom model. Multiple learner GPUs and experience replay are also supported.

IMPALA architecture

Tuned examples: PongNoFrameskip-v4, vectorized configuration, multi-gpu configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results @10M steps: more details

Atari env	RLlib IMPALA 32-workers	Mnih et al A3C 16-workers
BeamRider	2071	~3000
Breakout	385	~150
Qbert	4068	~1000
SpaceInvaders	719	~600

Scalability:

Atari env	RLlib IMPALA 32-workers @1 hour	Mnih et al A3C 16-workers @1 hour
BeamRider	3181	~1000
Breakout	538	~10
Qbert	10850	~500
SpaceInvaders	843	~300

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).

IMPALA-specific configs (see also common configs):

ray.rllib.algorithms.impala.impala.ImpalaConfig

Model-free Off-policy RL

Deep Q Networks (DQN, Rainbow, Parametric DQN)

[paper] [implementation] DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in Rainbow are available, though not all are enabled by default. See also how to use parametric-actions in DQN.

DQN architecture

Tuned examples: PongDeterministic-v4, Rainbow configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4, with Dueling and Double-Q, with Distributional DQN.

Tip

Consider using Ape-X for faster training with similar timestep efficiency.

Hint

For a complete rainbow setup, make the following changes to the default DQN config: "n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0 (set v_min and v_max according to your expected range of returns).

Atari results @10M steps: more details

Atari env	RLlib DQN	RLlib Dueling DDQN	RLlib Dist. DQN	Hessel et al. DQN
BeamRider	2869	1910	4447	~2000
Breakout	287	312	410	~150
Qbert	3921	7968	15780	~4000
SpaceInvaders	650	1001	1025	~500

DQN-specific configs (see also common configs):

ray.rllib.algorithms.dqn.dqn.DQNConfig

Soft Actor Critic (SAC)

[original paper], [follow up paper], [discrete actions paper] [implementation]

SAC architecture (same as DQN)

RLlib's soft-actor critic implementation is ported from the official SAC repo to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: policy_model_config and q_model_config, the model field of the config will be ignored.

Tuned examples (continuous actions): Pendulum-v1, HalfCheetah-v3, Tuned examples (discrete actions): CartPole-v1

MuJoCo results @3M steps: more details

MuJoCo env	RLlib SAC	Haarnoja et al SAC
HalfCheetah	13000	~15000

SAC-specific configs (see also common configs):

ray.rllib.algorithms.sac.sac.SACConfig

Model-based RL

DreamerV3

[paper] [implementation]

DreamerV3 trains a world model in supervised fashion using real environment interactions. The world model's objective is to correctly predict all aspects of the transition dynamics of the RL environment, which includes (besides predicting the correct next observations) predicting the received rewards as well as a boolean episode continuation flag. A "recurrent state space model" or RSSM is used to alternatingly train the world model (from actual env data) as well as the critic and actor networks, both of which are trained on "dreamed" trajectories produced by the world model.

DreamerV3 can be used in all types of environments, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions.

Tuned examples: Atari 100k, Atari 200M, DeepMind Control Suite

Pong-v5 results (1, 2, and 4 GPUs):

Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Atari 100k results (1 vs 4 GPUs):

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

DeepMind Control Suite (vision) results (1 vs 4 GPUs):

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Multi-agent

Parameter Sharing

[paper], [paper] and [instructions]. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.

Tuned examples: PettingZoo, waterworld, rock-paper-scissors, multi-agent cartpole

Shared Critic Methods

[instructions] Shared critic methods are when all agents use a single parameter shared critic network (in some cases with access to more of the observation space than agents can see). Note that many specialized multi-agent algorithms such as MADDPG are mostly shared critic forms of their single-agent algorithm (DDPG in the case of MADDPG).

Tuned examples: TwoStepGame

Fully Independent Learning

[instructions] Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.

Tuned examples: waterworld, multiagent-cartpole

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rllib-algorithms.rst

rllib-algorithms.rst

Algorithms

Available Algorithms - Overview

Offline

Behavior Cloning (BC; derived from MARWIL implementation)

Conservative Q-Learning (CQL)

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)

Model-free On-policy RL

Asynchronous Proximal Policy Optimization (APPO)

Proximal Policy Optimization (PPO)

Importance Weighted Actor-Learner Architecture (IMPALA)

Model-free Off-policy RL

Deep Q Networks (DQN, Rainbow, Parametric DQN)

Soft Actor Critic (SAC)

Model-based RL

DreamerV3

Multi-agent

Parameter Sharing

Shared Critic Methods

Fully Independent Learning

Files

rllib-algorithms.rst

Latest commit

History

rllib-algorithms.rst

File metadata and controls

Algorithms

Available Algorithms - Overview

Offline

Behavior Cloning (BC; derived from MARWIL implementation)

Conservative Q-Learning (CQL)

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)

Model-free On-policy RL

Asynchronous Proximal Policy Optimization (APPO)

Proximal Policy Optimization (PPO)

Importance Weighted Actor-Learner Architecture (IMPALA)

Model-free Off-policy RL

Deep Q Networks (DQN, Rainbow, Parametric DQN)

Soft Actor Critic (SAC)

Model-based RL

DreamerV3

Multi-agent

Parameter Sharing

Shared Critic Methods

Fully Independent Learning