Skip to content

Latest commit

 

History

History
422 lines (290 loc) · 24.5 KB

rllib-algorithms.rst

File metadata and controls

422 lines (290 loc) · 24.5 KB

Algorithms

Tip

Check out the environments page to learn more about different environment types.

Available Algorithms - Overview

Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support Multi-GPU
APPO tf + torch Yes +parametric Yes Yes +RNN, +LSTM auto-wrapping, +Attention, +autoreg tf + torch

BC CQL

tf + torch tf + torch

Yes +parametric No

Yes Yes

Yes No

+RNN

torch tf + torch

DreamerV3 DQN, Rainbow

tf tf + torch

Yes Yes +parametric

Yes No

No Yes

+RNN (GRU-based by default)

tf tf + torch

IMPALA tf + torch Yes +parametric Yes Yes +RNN, +LSTM auto-wrapping, +Attention, +autoreg tf + torch
MARWIL tf + torch Yes +parametric Yes Yes +RNN torch

PPO SAC

tf + torch tf + torch

Yes +parametric Yes

Yes Yes

Yes Yes

+RNN, +LSTM auto-wrapping, +Attention, +autoreg

tf + torch torch

Multi-Agent only Methods

Algorithm Frameworks Discrete Actions Continuous Actions Multi-Agent Model Support
Parameter Sharing Depends on bootstrapped algorithm

--------------------------------Fully Independent Learning

-----------Depends on

------------------------bootstrapped algorithm




--------------------------------Shared Critic Methods

-----------Depends on

------------------------bootstrapped algorithm




Offline

Behavior Cloning (BC; derived from MARWIL implementation)

pytorch tensorflow [paper] [implementation]

Our behavioral cloning implementation is directly derived from our MARWIL implementation, with the only difference being the beta parameter force-set to 0.0. This makes BC try to match the behavior policy, which generated the offline data, disregarding any resulting rewards. BC requires the offline datasets API to be used.

Tuned examples: CartPole-v1

BC-specific configs (see also common configs):

ray.rllib.algorithms.bc.bc.BCConfig

Conservative Q-Learning (CQL)

pytorch tensorflow [paper] [implementation]

In offline RL, the algorithm has no access to an environment, but can only sample from a fixed dataset of pre-collected state-action-reward tuples. In particular, CQL (Conservative Q-Learning) is an offline RL algorithm that mitigates the overestimation of Q-values outside the dataset distribution via conservative critic estimates. It does so by adding a simple Q regularizer loss to the standard Bellman update loss. This ensures that the critic does not output overly-optimistic Q-values. This conservative correction term can be added on top of any off-policy Q-learning algorithm (here, we provide this for SAC).

RLlib's CQL is evaluated against the Behavior Cloning (BC) benchmark at 500K gradient steps over the dataset. The only difference between the BC- and CQL configs is the bc_iters parameter in CQL, indicating how many gradient steps we perform over the BC loss. CQL is evaluated on the D4RL benchmark, which has pre-collected offline datasets for many types of environments.

Tuned examples: HalfCheetah Random, Hopper Random

CQL-specific configs (see also common configs):

ray.rllib.algorithms.cql.cql.CQLConfig

Monotonic Advantage Re-Weighted Imitation Learning (MARWIL)

pytorch tensorflow [paper] [implementation]

MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the beta hyperparameter is set to zero, the MARWIL objective reduces to vanilla imitation learning (see BC). MARWIL requires the offline datasets API to be used.

Tuned examples: CartPole-v1

MARWIL-specific configs (see also common configs):

ray.rllib.algorithms.marwil.marwil.MARWILConfig

Model-free On-policy RL

Asynchronous Proximal Policy Optimization (APPO)

pytorch tensorflow [paper] [implementation] We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.

Tip

APPO is not always more efficient; it is often better to use standard PPO <ppo> or IMPALA <impala>.

APPO architecture (same as IMPALA)

APPO architecture (same as IMPALA)

Tuned examples: PongNoFrameskip-v4

APPO-specific configs (see also common configs):

ray.rllib.algorithms.appo.appo.APPOConfig

Proximal Policy Optimization (PPO)

pytorch tensorflow [paper] [implementation] PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.

Tip

If you need to scale out with GPUs on multiple nodes, consider using decentralized PPO.

PPO architecture

PPO architecture

Tuned examples: Unity3D Soccer (multi-agent: Strikers vs Goalie), Humanoid-v1, Hopper-v1, Pendulum-v1, PongDeterministic-v4, Walker2d-v1, HalfCheetah-v2, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results: more details

Atari env RLlib PPO @10M RLlib PPO @25M Baselines PPO @10M
BeamRider 2807 4480 ~1800
Breakout 104 201 ~250
Qbert 11085 14247 ~14000
SpaceInvaders 671 944 ~800

Scalability: more details

MuJoCo env RLlib PPO 16-workers @ 1h Fan et al PPO 16-workers @ 1h
HalfCheetah 9664 ~7700

RLlib's multi-GPU PPO scales to multiple GPUs and hundreds of CPUs on solving the Humanoid-v1 task. Here we compare against a reference MPI-based implementation.

RLlib's multi-GPU PPO scales to multiple GPUs and hundreds of CPUs on solving the Humanoid-v1 task. Here we compare against a reference MPI-based implementation.

PPO-specific configs (see also common configs):

ray.rllib.algorithms.ppo.ppo.PPOConfig

Importance Weighted Actor-Learner Architecture (IMPALA)

pytorch tensorflow [paper] [implementation] In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference V-trace code. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a custom model. Multiple learner GPUs and experience replay are also supported.

IMPALA architecture

IMPALA architecture

Tuned examples: PongNoFrameskip-v4, vectorized configuration, multi-gpu configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4

Atari results @10M steps: more details

Atari env RLlib IMPALA 32-workers Mnih et al A3C 16-workers
BeamRider 2071 ~3000
Breakout 385 ~150
Qbert 4068 ~1000
SpaceInvaders 719 ~600

Scalability:

Atari env RLlib IMPALA 32-workers @1 hour Mnih et al A3C 16-workers @1 hour
BeamRider 3181 ~1000
Breakout 538 ~10
Qbert 10850 ~500
SpaceInvaders 843 ~300

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).

Multi-GPU IMPALA scales up to solve PongNoFrameskip-v4 in ~3 minutes using a pair of V100 GPUs and 128 CPU workers. The maximum training throughput reached is ~30k transitions per second (~120k environment frames per second).

IMPALA-specific configs (see also common configs):

ray.rllib.algorithms.impala.impala.ImpalaConfig

Model-free Off-policy RL

Deep Q Networks (DQN, Rainbow, Parametric DQN)

pytorch tensorflow [paper] [implementation] DQN can be scaled by increasing the number of workers or using Ape-X. Memory usage is reduced by compressing samples in the replay buffer with LZ4. All of the DQN improvements evaluated in Rainbow are available, though not all are enabled by default. See also how to use parametric-actions in DQN.

DQN architecture

DQN architecture

Tuned examples: PongDeterministic-v4, Rainbow configuration, {BeamRider,Breakout,Qbert,SpaceInvaders}NoFrameskip-v4, with Dueling and Double-Q, with Distributional DQN.

Tip

Consider using Ape-X for faster training with similar timestep efficiency.

Hint

For a complete rainbow setup, make the following changes to the default DQN config: "n_step": [between 1 and 10], "noisy": True, "num_atoms": [more than 1], "v_min": -10.0, "v_max": 10.0 (set v_min and v_max according to your expected range of returns).

Atari results @10M steps: more details

Atari env RLlib DQN RLlib Dueling DDQN RLlib Dist. DQN Hessel et al. DQN
BeamRider 2869 1910 4447 ~2000
Breakout 287 312 410 ~150
Qbert 3921 7968 15780 ~4000
SpaceInvaders 650 1001 1025 ~500

DQN-specific configs (see also common configs):

ray.rllib.algorithms.dqn.dqn.DQNConfig

Soft Actor Critic (SAC)

pytorch tensorflow [original paper], [follow up paper], [discrete actions paper] [implementation]

SAC architecture (same as DQN)

SAC architecture (same as DQN)

RLlib's soft-actor critic implementation is ported from the official SAC repo to better integrate with RLlib APIs. Note that SAC has two fields to configure for custom models: policy_model_config and q_model_config, the model field of the config will be ignored.

Tuned examples (continuous actions): Pendulum-v1, HalfCheetah-v3, Tuned examples (discrete actions): CartPole-v1

MuJoCo results @3M steps: more details

MuJoCo env RLlib SAC Haarnoja et al SAC
HalfCheetah 13000 ~15000

SAC-specific configs (see also common configs):

ray.rllib.algorithms.sac.sac.SACConfig

Model-based RL

DreamerV3

tensorflow [paper] [implementation]

DreamerV3 trains a world model in supervised fashion using real environment interactions. The world model's objective is to correctly predict all aspects of the transition dynamics of the RL environment, which includes (besides predicting the correct next observations) predicting the received rewards as well as a boolean episode continuation flag. A "recurrent state space model" or RSSM is used to alternatingly train the world model (from actual env data) as well as the critic and actor networks, both of which are trained on "dreamed" trajectories produced by the world model.

DreamerV3 can be used in all types of environments, including those with image- or vector based observations, continuous- or discrete actions, as well as sparse or dense reward functions.

Tuned examples: Atari 100k, Atari 200M, DeepMind Control Suite

Pong-v5 results (1, 2, and 4 GPUs):

Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Episode mean rewards for the Pong-v5 environment (with the "100k" setting, in which only 100k environment steps are allowed): Note that despite the stable sample efficiency - shown by the constant learning performance per env step - the wall time improves almost linearly as we go from 1 to 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Atari 100k results (1 vs 4 GPUs):

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

DeepMind Control Suite (vision) results (1 vs 4 GPUs):

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Episode mean rewards for various Atari 100k tasks on 1 vs 4 GPUs. Left: Episode reward over environment timesteps sampled. Right: Episode reward over wall-time.

Multi-agent

Parameter Sharing

[paper], [paper] and [instructions]. Parameter sharing refers to a class of methods that take a base single agent method, and use it to learn a single policy for all agents. This simple approach has been shown to achieve state of the art performance in cooperative games, and is usually how you should start trying to learn a multi-agent problem.

Tuned examples: PettingZoo, waterworld, rock-paper-scissors, multi-agent cartpole

Shared Critic Methods

[instructions] Shared critic methods are when all agents use a single parameter shared critic network (in some cases with access to more of the observation space than agents can see). Note that many specialized multi-agent algorithms such as MADDPG are mostly shared critic forms of their single-agent algorithm (DDPG in the case of MADDPG).

Tuned examples: TwoStepGame

Fully Independent Learning

[instructions] Fully independent learning involves a collection of agents learning independently of each other via single agent methods. This typically works, but can be less effective than dedicated multi-agent RL methods, since they do not account for the non-stationarity of the multi-agent environment.

Tuned examples: waterworld, multiagent-cartpole