Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor documentation issue with vanilla policy gradients #50

Closed
DanielTakeshi opened this issue Oct 22, 2016 · 2 comments
Closed

Minor documentation issue with vanilla policy gradients #50

DanielTakeshi opened this issue Oct 22, 2016 · 2 comments

Comments

@DanielTakeshi
Copy link

DanielTakeshi commented Oct 22, 2016

There seems to be a minor bias/variance typo. In the docs on vanilla policy gradients, it says:

When viewing the discount factor as a variance reduction factor for the undiscounted objective, this alternative gradient estimator has less bias, at the expense of having a larger variance

It seems like it should be the reverse: reducing variance but at the expense of larger bias. For instance in the OpenAI docs it says:

If the trajectories are very long (i.e., T is high), then the preceding formula will have excessive variance. Thus people generally use a discount factor, which reduces variance at the cost of some bias. The biased policy gradient formula is [...]

Though to be honest, I have very little intuition on how to tell which estimators have lower variance. I interpret the smaller variance compared to the undiscounted objective due to how the discounted version decreases the values that are in the advantage values (where "advantage" is assumed to mean anything that gets multiplied with the grad-log probability of the policy). Intuitively, we would want smaller advantage values in magnitude...

The other thing that may not be totally clear is why the gradient in the vanilla policy gradients has that extra 1/T term, because we want the expectation over the sum of T terms, right? The 1/N is understandable because that's like we have N elements so we take the average. I guess the 1/T gets absorbed into the constant when doing gradient updates?

@dementrock
Copy link
Member

dementrock commented Oct 22, 2016

Having a discount factor by itself reduce variance while introducing bias. However, when using the discount factor, when computing the advantage for a time step t you can either start counting the discount as gammat, or start counting the discount as gamma0. The latter will have larger variance, but it's less biased when the true objective does not contain a discount.

The extra 1/T term is purely notational, since it's assumed that each trajectory has up to T time steps, and there are N trajectories in total.

@DanielTakeshi
Copy link
Author

OK I think I got confused with what you were referring to. I think the conclusions are now:

No discount factor = no bias, but high variance
Discount factor, starting at gamma^t = some bias, lower variance
Discount factor, starting at gamma^0 = some bias (but less than if we had gamma^t), somewhat higher variance (but less variance than no discount factor).

Though this is probably going to be dependent on the true objective, as you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants