Pixel observation with recurrent SAC-Discrete #2

twni2016 · 2022-03-02T05:32:38Z

This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.

We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.

We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).

Our running command:

# We sweep over the following range
python3 policies/main.py --cfg configs/pomdp/catch/rnn.yml --noautomatic_entropy_tuning --entropy_alpha [0.1,0.01,0.001]

where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).

Delayed-cach with 5 runs: solve it with 100k samples

Delayed-cach with 10 runs: solve it with 400k samples (vs 50M for IMPALA+SR)

Delayed-cach with 20 runs: solve it with 700k samples (vs 100M for IMPALA+SR)

Delayed-cach with 40 runs: after hparam tuning, can solve it with 2M samples (vs 200M for IMPALA+SR)

Different fixed alpha value:

With alpha=0.1:

twni2016 · 2022-03-13T04:02:30Z

Now I cannot reproduce the results given the same seed. I confirm that it is from pytorch side, not numpy or gym side.

…to pixel-obs

twni2016 · 2022-05-31T22:23:16Z

Close this PR as it will be merged to main via #13

twni2016 added 16 commits February 28, 2022 22:23

introduce recurrent sac-discrete

72d6462

add readme

2a07e8c

black format

1f7940d

fix potential bug in sac-discrete

385be91

discount 0.99 is important to sacd in cartpole; introduce lunarlander

43bb3bb

minor

27ae0fc

introduce pixel obs POMDP env as sanity check

06ded95

black

a35b22e

update config for catch-40

9eb2d79

fix error in env reward, make o and 0.25 default

1197e3c

MINOR

47ea711

support tuning entropy_alpha

d2b0f39

merge

89e195c

Merge branch 'master' into sac-discrete

5d9b3ff

Merge remote-tracking branch 'origin/sac-discrete' into pixel-obs

07e0976

fix error

1369994

twni2016 changed the title ~~Pixel observation with recurrent SAC-Discrete (a sanity check)~~ Pixel observation with recurrent SAC-Discrete Mar 9, 2022

twni2016 added 4 commits March 12, 2022 01:06

Merge remote-tracking branch 'origin/main' into pixel-obs

f568031

Merge remote-tracking branch 'origin/main' into pixel-obs

470f4eb

Merge remote-tracking branch 'origin/main' into pixel-obs

e20c1fb

fix minor bug

7dab076

twni2016 added 8 commits March 19, 2022 18:25

add key2door env

e9b05d6

Merge branch 'main' of https://github.com/twni2016/pomdp-baselines in…

e265f4a

…to pixel-obs

refactor the gym wrapper

fbd306c

runnable

8bb2ffb

fix metric bug and reformat

11dc70a

update env.yml

e0a7231

update plot script

5a427e4

minor

4b966a0

twni2016 added 4 commits April 21, 2022 01:04

add key2door low/high variance

946fcae

fix discrepancy in max_frames in keytodoor

d19252b

add eval scripts

d529d15

move to a separate dir

b8196e7

twni2016 closed this May 31, 2022

twni2016 deleted the pixel-obs branch June 1, 2022 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pixel observation with recurrent SAC-Discrete #2

Pixel observation with recurrent SAC-Discrete #2

twni2016 commented Mar 2, 2022 •

edited

twni2016 commented Mar 13, 2022

twni2016 commented May 31, 2022

Pixel observation with recurrent SAC-Discrete #2

Pixel observation with recurrent SAC-Discrete #2

Conversation

twni2016 commented Mar 2, 2022 • edited

twni2016 commented Mar 13, 2022

twni2016 commented May 31, 2022

twni2016 commented Mar 2, 2022 •

edited