[RLlib] Replay Buffer API and Training Iteration Fn for DQN. #23420

ArturNiederfahrenhorst · 2022-03-23T13:51:00Z

Moving DQN into the new training iteration API (from execution_plan). First benchmarks indicate at least equal performance on a Breakout Atari task:

Config:

atari-basic-dqn:
    env: BreakoutNoFrameskip-v4
    run: DQN
    config:
        # Works for both torch and tf.
        framework: tf
        double_q: false
        dueling: false
        num_atoms: 1
        noisy: false
        prioritized_replay: false
        n_step: 1
        target_network_update_freq: 8000
        lr: .0000625
        adam_epsilon: .00015
        hiddens: [512]
        learning_starts: 20000

        replay_buffer_config:
            capacity: 1000000
            prioritized_replay_alpha: 0.5

        rollout_fragment_length: 4
        train_batch_size: 32
        exploration_config:
          epsilon_timesteps: 200000
          final_epsilon: 0.01
        num_gpus: 1
        timesteps_per_iteration: 10000

        _disable_execution_plan_api:
            grid_search: [true, false]

Results:

Current time: 2022-04-08 03:39:52 (running for 01:45:05.95)
Memory usage on this node: 159.2/239.9 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.0/32 CPUs, 2.0/4 GPUs, 0.0/147.65 GiB heap, 0.0/67.27 GiB objects
Result logdir: /home/ray/ray_results/atari-basic-dqn
Number of trials: 2/2 (2 RUNNING)
+----------------------------------------+----------+-------------------+-------------------------------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+
| Trial name                             | status   | loc               | _disable_execution_plan_api   |   iter |   total time (s) |     ts |   reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |
|----------------------------------------+----------+-------------------+-------------------------------+--------+------------------+--------+----------+----------------------+----------------------+--------------------|
| DQN_BreakoutNoFrameskip-v4_8a7c0_00000 | RUNNING  | 10.0.67.156:42378 | True                          |     69 |          6243.36 | 690000 |    53.86 |                  242 |                   17 |            4805.07 |
| DQN_BreakoutNoFrameskip-v4_8a7c0_00001 | RUNNING  | 10.0.67.156:42377 | False                         |     68 |          6255.12 | 680000 |    41.1  |                   71 |                   18 |            4323.95 |
+----------------------------------------+----------+-------------------+-------------------------------+--------+------------------+--------+----------+----------------------+----------------------+--------------------+

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…methods, docstrings

minors in other buffer classes

…ayBufferAPI_tests

rllib/agents/dqn/dqn.py

sven1977

Very nice PR @ArturNiederfahrenhorst , just a few nits left to fix. The regression tests for DQN vs Breakout, comparing execution_plan vs training_iteration look super solid so far.

Co-authored-by: Sven Mika <sven@anyscale.io>

sven1977 · 2022-04-11T06:46:23Z

Hey @ArturNiederfahrenhorst , great PR and the benchmarks look really cool!

Let's get this merged, but pull from master first, due to some changes made via the @smorad 's PPO PR.
There are some tests failing. Could be related to this, but not sure (I think they were failing already before).

sven1977

Looks good. Only one nit left.

rllib/agents/dqn/dqn.py

rllib/agents/dqn/simple_q_tf_policy.py

…y buffer config

smorad · 2022-04-17T19:57:41Z

rllib/agents/dqn/apex.py

-        "no_local_replay_buffer": True,
+        "replay_buffer_config": {
+            # For now we don't use the new ReplayBuffer API here
+            "_enable_replay_buffer_api": False,


I'd really love to use the new replay buffer with APEX. What's the blocker here? Do we just need to rewrite the training_iteration function?

As soon as this PR is done I'll chat with Avnish to make sure we are aligned on changes to Ape-X and then do this!

smorad · 2022-04-17T20:02:03Z

rllib/agents/dqn/dqn.py

            "prioritized_replay_alpha": 0.6,
            # Beta parameter for sampling from prioritized replay buffer.
            "prioritized_replay_beta": 0.4,
            # Epsilon to add to the TD errors when updating priorities.
            "prioritized_replay_eps": 1e-6,
+            # The number of continuous environment steps to replay at once. This may
+            # be set to greater than 1 to support recurrent models.
+            "replay_sequence_length": 1,


Am I correct in that this is analogous to max_sequence_length from policy-gradient based policies? It might make sense to have a full_episode setting for variable-length episodes. The batch can be right zero-padded to the longest episode length in the train batch.

Yes, the replayed sequences is often shorter than replay_sequence_length.
What you are describing can be accomplished by setting storage_unit="episodes" or storage_unit=StorageUnit.EPISODES. Padding of the batch is so far not handled by the buffers, which is open for discussion! I think it should not be handled by buffers, especially so that buffers can be reinstantiated from checkpoints in any setting.

smorad · 2022-04-17T20:06:57Z

rllib/agents/dqn/dqn.py

+                    batch_indices = batch_indices.reshape([-1, T])[:, 0]
+                    assert len(batch_indices) == len(td_error)
+                prio_dict[policy_id] = (batch_indices, td_error)
+            local_replay_buffer.update_priorities(prio_dict)


How does PER sample using recurrent policies? Do you assign a single priority to the entire replay_sequence, or do you sum up the priorities of a replay_sequence?

The entire sequence!
Today, Priority is is assigned per slot in the buffer.
Depending on the replayed item's unit (timestep, sequence, episode) you choose, one priority applies to one item.
Do you think we should change this? So far this replicates what we have done in the past.

Yeah, we need to discuss restructuring our agents folder anyways. I'm actually in favor of separating every single algo, like APEX from DQN, and R2D2 from DQN (have them all as separate algos). APEX and R2D2 have no visibility right now, b/c they are buried inside DQN.

smorad · 2022-04-17T20:12:39Z

rllib/agents/dqn/r2d2.py

@@ -131,14 +127,14 @@ def validate_config(self, config: TrainerConfigDict) -> None:
        # Call super's validation method.
        super().validate_config(config)

-        if config["replay_sequence_length"] != -1:
+        if config["replay_buffer_config"]["replay_sequence_length"] != -1:


With your improved ReplayBuffer API I wonder if it makes sense to keep around R2D2 as a separate agent from DQN. AFAIK the only difference is the use of the TD error weighting function h and LSTM burn-in. If you plug these options into the DQN config, would you get distributed R2D2 for free via APEX?

Not worth doing in this PR, but might be worth doing after this is merged.

I have a very similar opinion.
I believe we should keep our algorithms section slimmer and provide R2D2 as an example script.
Same goes for RNNSAC.

smorad · 2022-04-17T20:16:09Z

rllib/agents/dqn/simple_q.py

    "replay_buffer_config": {
        # Use the new ReplayBuffer API here
        "_enable_replay_buffer_api": True,
+        # How many steps of the model to sample before learning starts.
+        "learning_starts": 1000,


What are the units here? Is a model step equivalent to a timestep?

I'll specify this!
Thanks for all your awesome comments!
Really cool stuff 💯

Just realized that this phrase is all over the library.

ArturNiederfahrenhorst and others added 30 commits February 3, 2022 12:57

first draft of classes

3d6befc

formatting

28f23d3

added config

16d8d1e

typo

4988b2b

Reservoir buffer sketch and new typehints for sample()

9c45591

wip, ray-project#22114

15b2e04

wip ray-project#22114

2c6daba

added missing docstrings

2d10d74

Partial MixInReplayBuffer rewrite with added get_state and set_state …

83a2dcb

…methods, docstrings

sven's nits

49e75da

wip

9d17c4d

Merge branch 'master' into ReplayBufferAPI_tests

96a4250

jungs TODO from initial ReplayBuffer PR

bfbc354

first bunch of tests

ccacadc

features and fixes that came with first couple of tests

6afc21c

replay buffer and tests done

4e4dbe5

prioritized replay buffer and tests done

95e0ee3

merge from master

f47a0a1

wip

53f9dd8

Apply suggestions from code review

5bd50ad

MultiAgentReplayBuffer and tests

0b64d62

minors in other buffer classes

Merge remote-tracking branch 'origin/ReplayBufferAPI_tests' into Repl…

ee37a85

…ayBufferAPI_tests

MultiAgentReplayBuffer better tests and warning

c6a73e1

Added MultiAgentPrioritizedReplayBuffer and tests

13032ac

minors

3da08fc

multi agent prioritized comments, fixes

bf4a665

multi agent comments, fixes

a7b7c3e

MultiAgentMixInReplayBuffer and tests

90f3eca

Reservoir Buffer and tests

888dca7

wip

0fd7a63

sven1977 reviewed Apr 8, 2022

View reviewed changes

rllib/agents/dqn/dqn.py Show resolved Hide resolved

sven1977 reviewed Apr 8, 2022

View reviewed changes

ArturNiederfahrenhorst and others added 6 commits April 8, 2022 14:41

Better textual algorithm description

0cc4aef

Co-authored-by: Sven Mika <sven@anyscale.io>

wip

aca291a

sven's comments

016988f

wip

78be2de

moved buffer config validation code into validate_config

b5ec0d7

wip

f3c454b

ArturNiederfahrenhorst added 13 commits April 11, 2022 08:57

Merge branch 'master' into ReplayBufferAPI_DQN

844f2b4

DQN fix training iteration fn after master merge

6ded5f1

wip

fcf3e7a

fix burn in

69829fe

Merge branch 'master' into ReplayBufferAPI_DQN

6cc9939

fixes some algos trying to create local replay buffer

def0720

wip

0c3cc78

properly detect prioritized replay in execution plan

8836fbd

slateq buffer now created w/ from_config()

2d1b873

fixed replay_sequence_length modification in wrong order for RNNSAC

8c9492e

fix R2D2 and SAC using DEPRECATED_VALUE as a genuine value

01934a1

Merge branch 'master' into ReplayBufferAPI_DQN

fc4dcc8

Merge branch 'master' into ReplayBufferAPI_DQN

59aba7f

sven1977 approved these changes Apr 14, 2022

View reviewed changes

rllib/agents/dqn/dqn.py Show resolved Hide resolved

rllib/agents/dqn/simple_q_tf_policy.py Show resolved Hide resolved

ArturNiederfahrenhorst added 3 commits April 14, 2022 18:45

Add more deprecated values to simple q config.

59d92a0

fixed old location of prioritized_replay taking effect and r2d2 repla…

fe1cad1

…y buffer config

fix detection of bad replay buffer configs for prioritization

6a97713

smorad reviewed Apr 17, 2022

View reviewed changes

sven1977 merged commit e57ce7e into ray-project:master Apr 18, 2022

ArturNiederfahrenhorst mentioned this pull request Apr 18, 2022

[RLlib] Better description for learning_starts parameter #23969

Closed

6 tasks

ArturNiederfahrenhorst deleted the ReplayBufferAPI_DQN branch April 24, 2022 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Replay Buffer API and Training Iteration Fn for DQN. #23420

[RLlib] Replay Buffer API and Training Iteration Fn for DQN. #23420

ArturNiederfahrenhorst commented Mar 23, 2022 •

edited by sven1977

sven1977 left a comment

sven1977 commented Apr 11, 2022

sven1977 left a comment

smorad Apr 17, 2022

ArturNiederfahrenhorst Apr 17, 2022

smorad Apr 17, 2022

ArturNiederfahrenhorst Apr 17, 2022

smorad Apr 17, 2022

ArturNiederfahrenhorst Apr 17, 2022

sven1977 Apr 18, 2022

smorad Apr 17, 2022

ArturNiederfahrenhorst Apr 17, 2022

smorad Apr 17, 2022 •

edited

ArturNiederfahrenhorst Apr 17, 2022

ArturNiederfahrenhorst Apr 17, 2022

[RLlib] Replay Buffer API and Training Iteration Fn for DQN. #23420

[RLlib] Replay Buffer API and Training Iteration Fn for DQN. #23420

Conversation

ArturNiederfahrenhorst commented Mar 23, 2022 • edited by sven1977

Why are these changes needed?

Related issue number

Checks

sven1977 left a comment

Choose a reason for hiding this comment

sven1977 commented Apr 11, 2022

sven1977 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smorad Apr 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArturNiederfahrenhorst commented Mar 23, 2022 •

edited by sven1977

smorad Apr 17, 2022 •

edited