Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

theoretical question #5

Closed
hlsfin opened this issue Feb 22, 2022 · 7 comments
Closed

theoretical question #5

hlsfin opened this issue Feb 22, 2022 · 7 comments

Comments

@hlsfin
Copy link

hlsfin commented Feb 22, 2022

Do you think the improvement noticed here has more to do with the fact that this is essentially two impala network compared to one impala, hence bigger network just tend to do better?

thank you

@hlsfin
Copy link
Author

hlsfin commented Feb 25, 2022

Also, I ran DAAC model on Atari games, and it didn't do as well as PPO methods; Can this just be a hyper parameter problem or something a bit deeper?
Thank you

@rraileanu
Copy link
Owner

DAAC is supposed to fix a problem that comes up in procedurally generated environments, where you train on a relatively small number of instances (compared to the test distriction) and want to generalize to new ones (in particular, when learning from high-dimensional inputs like images, and when the different instances have different episode lengths -- although this is not necessary).

This isn't a problem when you train and test on the same environment, as is the case in Atari, which doesn't have a notion of generalization or training on different task instances (each Atari game represents a single environment). It also wouldn't be as much of a problem if you train on all the environments you want to test on, even if they are procedurally generated.

The problem we describe in the paper that leads to overfitting when training on a subset of the environments and want to generalize to new ones from the same distribution remains irrespective of how large the network is. These 2 components are orthogonal, so I doubt that doubling the PPO network would get the same performance, although it might do better than the original PPO. But then, your baseline changes, so I would expect doubling the DAAC networks would also help. In any case, the main idea of our work is to decouple the training of the policy and value networks (which particularly helps generalization but also sample efficiency), rather than simply increase capacity.

@hlsfin
Copy link
Author

hlsfin commented Feb 25, 2022

I see, thank you for your explanation. However, I was under the impression that it would at least match PPO performance on a game of Breakout. What could be a possible explanation for this? Does having a bigger CNN architecture actually hurt the learning process?

@rraileanu
Copy link
Owner

rraileanu commented Feb 25, 2022

Did you do a hyperparameter (HP) search? The optimal HPs for Atari might be different than for Procgen, and it's typical to tune HPs for each benchmark.

I don't think the bigger CNN is the main problem. However, it could be that gradients from the value provide more signal than those from the advantage (they are even noisier given that they are also dependant on the action and not only the action). Thus, it is possible that PPO does better on such singleton environments, although I would expect a small difference from PPO if properly tuned.

@hlsfin
Copy link
Author

hlsfin commented Feb 25, 2022

No, I have not, I thought the default values given was a good start to train on for Breakout. Also, for frame stacking, [num_process,framestack,84,84] for grayscale; how would I handle RGB frame stacking [num_process,framestack,3,84,84]. My guess is to vstack them such that it would be [num_process,7 (framestack*3),84,84] .

@hlsfin
Copy link
Author

hlsfin commented Feb 25, 2022

image
Here is what I got so far.

@rraileanu
Copy link
Owner

Yes that sounds right.

@hlsfin hlsfin closed this as completed Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants