New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
theoretical question #5
Comments
Also, I ran DAAC model on Atari games, and it didn't do as well as PPO methods; Can this just be a hyper parameter problem or something a bit deeper? |
DAAC is supposed to fix a problem that comes up in procedurally generated environments, where you train on a relatively small number of instances (compared to the test distriction) and want to generalize to new ones (in particular, when learning from high-dimensional inputs like images, and when the different instances have different episode lengths -- although this is not necessary). This isn't a problem when you train and test on the same environment, as is the case in Atari, which doesn't have a notion of generalization or training on different task instances (each Atari game represents a single environment). It also wouldn't be as much of a problem if you train on all the environments you want to test on, even if they are procedurally generated. The problem we describe in the paper that leads to overfitting when training on a subset of the environments and want to generalize to new ones from the same distribution remains irrespective of how large the network is. These 2 components are orthogonal, so I doubt that doubling the PPO network would get the same performance, although it might do better than the original PPO. But then, your baseline changes, so I would expect doubling the DAAC networks would also help. In any case, the main idea of our work is to decouple the training of the policy and value networks (which particularly helps generalization but also sample efficiency), rather than simply increase capacity. |
I see, thank you for your explanation. However, I was under the impression that it would at least match PPO performance on a game of Breakout. What could be a possible explanation for this? Does having a bigger CNN architecture actually hurt the learning process? |
Did you do a hyperparameter (HP) search? The optimal HPs for Atari might be different than for Procgen, and it's typical to tune HPs for each benchmark. I don't think the bigger CNN is the main problem. However, it could be that gradients from the value provide more signal than those from the advantage (they are even noisier given that they are also dependant on the action and not only the action). Thus, it is possible that PPO does better on such singleton environments, although I would expect a small difference from PPO if properly tuned. |
No, I have not, I thought the default values given was a good start to train on for Breakout. Also, for frame stacking, [num_process,framestack,84,84] for grayscale; how would I handle RGB frame stacking [num_process,framestack,3,84,84]. My guess is to vstack them such that it would be [num_process,7 (framestack*3),84,84] . |
Yes that sounds right. |
Do you think the improvement noticed here has more to do with the fact that this is essentially two impala network compared to one impala, hence bigger network just tend to do better?
thank you
The text was updated successfully, but these errors were encountered: