You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When viewing the discount factor as a variance reduction factor for the undiscounted objective, this alternative gradient estimator has less bias, at the expense of having a larger variance
It seems like it should be the reverse: reducing variance but at the expense of larger bias. For instance in the OpenAI docs it says:
If the trajectories are very long (i.e., T is high), then the preceding formula will have excessive variance. Thus people generally use a discount factor, which reduces variance at the cost of some bias. The biased policy gradient formula is [...]
Though to be honest, I have very little intuition on how to tell which estimators have lower variance. I interpret the smaller variance compared to the undiscounted objective due to how the discounted version decreases the values that are in the advantage values (where "advantage" is assumed to mean anything that gets multiplied with the grad-log probability of the policy). Intuitively, we would want smaller advantage values in magnitude...
The other thing that may not be totally clear is why the gradient in the vanilla policy gradients has that extra 1/T term, because we want the expectation over the sum of T terms, right? The 1/N is understandable because that's like we have N elements so we take the average. I guess the 1/T gets absorbed into the constant when doing gradient updates?
The text was updated successfully, but these errors were encountered:
Having a discount factor by itself reduce variance while introducing bias. However, when using the discount factor, when computing the advantage for a time step t you can either start counting the discount as gammat, or start counting the discount as gamma0. The latter will have larger variance, but it's less biased when the true objective does not contain a discount.
The extra 1/T term is purely notational, since it's assumed that each trajectory has up to T time steps, and there are N trajectories in total.
OK I think I got confused with what you were referring to. I think the conclusions are now:
No discount factor = no bias, but high variance
Discount factor, starting at gamma^t = some bias, lower variance
Discount factor, starting at gamma^0 = some bias (but less than if we had gamma^t), somewhat higher variance (but less variance than no discount factor).
Though this is probably going to be dependent on the true objective, as you mentioned.
There seems to be a minor bias/variance typo. In the docs on vanilla policy gradients, it says:
It seems like it should be the reverse: reducing variance but at the expense of larger bias. For instance in the OpenAI docs it says:
Though to be honest, I have very little intuition on how to tell which estimators have lower variance. I interpret the smaller variance compared to the undiscounted objective due to how the discounted version decreases the values that are in the advantage values (where "advantage" is assumed to mean anything that gets multiplied with the grad-log probability of the policy). Intuitively, we would want smaller advantage values in magnitude...
The other thing that may not be totally clear is why the gradient in the vanilla policy gradients has that extra 1/T term, because we want the expectation over the sum of T terms, right? The 1/N is understandable because that's like we have N elements so we take the average. I guess the 1/T gets absorbed into the constant when doing gradient updates?
The text was updated successfully, but these errors were encountered: