-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem of auto alpha in SAC #258
Comments
Interesting. I see they've discussed two problems:
The first problem is made clear that optimizing _log_alpha prevents alpha from being negative, which otherwise will cause NaN error. But the second problem is not well discussed, from my point of view. Although someone argues that the only effective information is sign of the gradient, but my argument is gradient of exp(_log_alpha) can prevent alpha dropping too fast when it is close to zero. Still, it needs further experiments to get validated. |
I don't understand why we need to maintain an optimizer for alpha? Since the log_prob has be detached, isn't alpha simply determined by:
Another question is why log_prob needs to be detached? |
Check the paper here. The original objective of alpha is a dual problem which is optimized by approximating dual gradient descent. This is done by alternating between optimizing policy with respect to current alpha and taking a gradient step on alpha. So alpha is not simply determined by the sign of (log_prob + target_entropy) . |
Thanks for your reply. I still don't understand why alpha is NOT determined by the sign of (log_prob + target_entropy). Since the optimization is done by alternatively updating the alpha and the policy, it seems for me that when updating alpha, the term (log_prob + target_entropy) can be regarded as constant. (Please correct me if I'm wrong) |
Hi, it seems like you don't know constrained optimization problem or the method of Lagrange multipliers. It's hard to explain them in a few words. A simple but possibly inaccurate explanation is that the optimal alpha is in terms of the optimal policy and since the policy is not optimal during gradient descent so you can not set alpha to some value directly. |
I think we should use alpha instead of log_alpha for 2 reasons:
It looks like they use alpha and not log_alpha in the official SAC repo as well: https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L256 |
https://github.com/thu-ml/tianshou/tree/master/examples/mujoco#sac |
Dear author,
I find there is inconsistence between your implementation and the algorithm described by original paper https://arxiv.org/abs/1812.05905
instead of
tianshou/tianshou/policy/modelfree/sac.py
Line 189 in cd48142
the optimization objective for temperature alpha should be
alpha_loss = -(torch.exp(self._log_alpha)* log_prob).mean()
The text was updated successfully, but these errors were encountered: