Minor documentation issue with vanilla policy gradients #50

DanielTakeshi · 2016-10-22T05:06:45Z

There seems to be a minor bias/variance typo. In the docs on vanilla policy gradients, it says:

When viewing the discount factor as a variance reduction factor for the undiscounted objective, this alternative gradient estimator has less bias, at the expense of having a larger variance

It seems like it should be the reverse: reducing variance but at the expense of larger bias. For instance in the OpenAI docs it says:

If the trajectories are very long (i.e., T is high), then the preceding formula will have excessive variance. Thus people generally use a discount factor, which reduces variance at the cost of some bias. The biased policy gradient formula is [...]

Though to be honest, I have very little intuition on how to tell which estimators have lower variance. I interpret the smaller variance compared to the undiscounted objective due to how the discounted version decreases the values that are in the advantage values (where "advantage" is assumed to mean anything that gets multiplied with the grad-log probability of the policy). Intuitively, we would want smaller advantage values in magnitude...

The other thing that may not be totally clear is why the gradient in the vanilla policy gradients has that extra 1/T term, because we want the expectation over the sum of T terms, right? The 1/N is understandable because that's like we have N elements so we take the average. I guess the 1/T gets absorbed into the constant when doing gradient updates?

dementrock · 2016-10-22T20:11:50Z

Having a discount factor by itself reduce variance while introducing bias. However, when using the discount factor, when computing the advantage for a time step t you can either start counting the discount as gammat, or start counting the discount as gamma0. The latter will have larger variance, but it's less biased when the true objective does not contain a discount.

The extra 1/T term is purely notational, since it's assumed that each trajectory has up to T time steps, and there are N trajectories in total.

DanielTakeshi · 2016-10-22T22:06:00Z

OK I think I got confused with what you were referring to. I think the conclusions are now:

No discount factor = no bias, but high variance
Discount factor, starting at gamma^t = some bias, lower variance
Discount factor, starting at gamma^0 = some bias (but less than if we had gamma^t), somewhat higher variance (but less variance than no discount factor).

Though this is probably going to be dependent on the true objective, as you mentioned.

DanielTakeshi closed this as completed Oct 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor documentation issue with vanilla policy gradients #50

Minor documentation issue with vanilla policy gradients #50

DanielTakeshi commented Oct 22, 2016 •

edited

Loading

dementrock commented Oct 22, 2016 •

edited

Loading

DanielTakeshi commented Oct 22, 2016

Minor documentation issue with vanilla policy gradients #50

Minor documentation issue with vanilla policy gradients #50

Comments

DanielTakeshi commented Oct 22, 2016 • edited Loading

dementrock commented Oct 22, 2016 • edited Loading

DanielTakeshi commented Oct 22, 2016

DanielTakeshi commented Oct 22, 2016 •

edited

Loading

dementrock commented Oct 22, 2016 •

edited

Loading