Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pixel observation with recurrent SAC-Discrete #2

Closed
wants to merge 32 commits into from
Closed

Conversation

twni2016
Copy link
Owner

@twni2016 twni2016 commented Mar 2, 2022

This PR is not intended to be merged, but as a showcase for support pixel observation with discrete action space, e.g. Atari games.

We take delayed-catch environment as a sanity check, introduced by IMPALA+SR https://arxiv.org/abs/2102.12425 The environment has only terminal reward and requires long-term memory. It has image size of 1x7x7, discrete action of 3, horizon of ~runs*7. We use a simple image encoder for image observation to replace the MLP encoder for vector observation.

We try delayed-catch with 5, 10, 20, 40 runs. The more the runs, the harder the problems. Below are the learning curves of 10, 20, 40 runs for IMPALA and IMPALA+SR (their Fig. 7b).

Screen Shot 2022-03-02 at 1 40 29 PM

Our running command:

# We sweep over the following range
python3 policies/main.py --cfg configs/pomdp/catch/rnn.yml --noautomatic_entropy_tuning --entropy_alpha [0.1,0.01,0.001]

where we found fixed temperature works much better than auto-tuning it with target entropy in this task. (Still a bit strange why this can work but that cannot; auto-tuning will finally has zero actor gradient).

  • Delayed-cach with 5 runs: solve it with 100k samples

Screen Shot 2022-03-09 at 2 22 10 AM

  • Delayed-cach with 10 runs: solve it with 400k samples (vs 50M for IMPALA+SR)

Screen Shot 2022-03-09 at 2 21 21 AM

  • Delayed-cach with 20 runs: solve it with 700k samples (vs 100M for IMPALA+SR)

Screen Shot 2022-03-09 at 2 22 09 PM

  • Delayed-cach with 40 runs: after hparam tuning, can solve it with 2M samples (vs 200M for IMPALA+SR)

Different fixed alpha value:
Screen Shot 2022-03-11 at 2 05 30 PM

With alpha=0.1:
Screen Shot 2022-03-13 at 5 07 47 PM

@twni2016 twni2016 changed the title Pixel observation with recurrent SAC-Discrete (a sanity check) Pixel observation with recurrent SAC-Discrete Mar 9, 2022
@twni2016
Copy link
Owner Author

Now I cannot reproduce the results given the same seed. I confirm that it is from pytorch side, not numpy or gym side.

@twni2016
Copy link
Owner Author

Close this PR as it will be merged to main via #13

@twni2016 twni2016 closed this May 31, 2022
@twni2016 twni2016 deleted the pixel-obs branch June 1, 2022 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant