Problem of auto alpha in SAC #258

milkpku · 2020-12-08T08:24:39Z

Dear author,

I find there is inconsistence between your implementation and the algorithm described by original paper https://arxiv.org/abs/1812.05905

instead of

tianshou/tianshou/policy/modelfree/sac.py

Line 189 in cd48142

alpha_loss = -(self._log_alpha * log_prob).mean()

the optimization objective for temperature alpha should be
alpha_loss = -(torch.exp(self._log_alpha)* log_prob).mean()

The text was updated successfully, but these errors were encountered:

Trinkle23897 · 2020-12-08T08:29:40Z

DLR-RM/stable-baselines3#36 (comment)

milkpku · 2020-12-08T09:55:36Z

Interesting. I see they've discussed two problems:

Optimize variable: _alpha vs. _log_alpha
Optimize objective: exp(_log_alpha) * log_prob vs. _log_alpha * log_prob

The first problem is made clear that optimizing _log_alpha prevents alpha from being negative, which otherwise will cause NaN error.

But the second problem is not well discussed, from my point of view. Although someone argues that the only effective information is sign of the gradient, but my argument is gradient of exp(_log_alpha) can prevent alpha dropping too fast when it is close to zero. Still, it needs further experiments to get validated.

dxing-cs · 2020-12-13T09:23:55Z

I don't understand why we need to maintain an optimizer for alpha? Since the log_prob has be detached, isn't alpha simply determined by:

(log_prob + target_entropy) >0: alpha = inf
(log_prob + target_entropy) <=0: alpha = 0

Another question is why log_prob needs to be detached?

danagi · 2020-12-13T12:19:54Z

I don't understand why we need to maintain an optimizer for alpha? Since the log_prob has be detached, isn't alpha simply determined by:

(log_prob + target_entropy) >0: alpha = inf

(log_prob + target_entropy) <=0: alpha = 0

Another question is why log_prob needs to be detached?

Check the paper here. The original objective of alpha is a dual problem which is optimized by approximating dual gradient descent. This is done by alternating between optimizing policy with respect to current alpha and taking a gradient step on alpha. So alpha is not simply determined by the sign of (log_prob + target_entropy) .
Also, optimizing this objective is impratical. So a truncated version is used leading to detaching log_prob.

dxing-cs · 2020-12-13T12:36:05Z

I don't understand why we need to maintain an optimizer for alpha? Since the log_prob has be detached, isn't alpha simply determined by:

(log_prob + target_entropy) >0: alpha = inf

(log_prob + target_entropy) <=0: alpha = 0

Another question is why log_prob needs to be detached?

Check the paper here. The original objective of alpha is a dual problem which is optimized by approximating dual gradient descent. This is done by alternating between optimizing policy with respect to current alpha and taking a gradient step on alpha. So alpha is not simply determined by the sign of (log_prob + target_entropy) .
Also, optimizing this objective is impratical. So a truncated version is used leading to detaching log_prob.

Thanks for your reply. I still don't understand why alpha is NOT determined by the sign of (log_prob + target_entropy). Since the optimization is done by alternatively updating the alpha and the policy, it seems for me that when updating alpha, the term (log_prob + target_entropy) can be regarded as constant. (Please correct me if I'm wrong)

danagi · 2020-12-13T13:33:11Z

I don't understand why we need to maintain an optimizer for alpha? Since the log_prob has be detached, isn't alpha simply determined by:

(log_prob + target_entropy) >0: alpha = inf

(log_prob + target_entropy) <=0: alpha = 0

Another question is why log_prob needs to be detached?

Check the paper here. The original objective of alpha is a dual problem which is optimized by approximating dual gradient descent. This is done by alternating between optimizing policy with respect to current alpha and taking a gradient step on alpha. So alpha is not simply determined by the sign of (log_prob + target_entropy) .
Also, optimizing this objective is impratical. So a truncated version is used leading to detaching log_prob.

Thanks for your reply. I still don't understand why alpha is NOT determined by the sign of (log_prob + target_entropy). Since the optimization is done by alternatively updating the alpha and the policy, it seems for me that when updating alpha, the term (log_prob + target_entropy) can be regarded as constant. (Please correct me if I'm wrong)

Hi, it seems like you don't know constrained optimization problem or the method of Lagrange multipliers. It's hard to explain them in a few words. A simple but possibly inaccurate explanation is that the optimal alpha is in terms of the optimal policy and since the policy is not optimal during gradient descent so you can not set alpha to some value directly.

alexnikulkov · 2022-03-08T23:10:12Z

I think we should use alpha instead of log_alpha for 2 reasons:

[Philosophical reason] That's how it's implemented in the SAC paper, so we should stick as close as possible to their implementation, unless we have evidence that our implementation performs better.
[Technical reason] The loss represents a Lagrangian of a constrained optimization problem. Alpha is a dual variable for the entropy constraint. Since entropy constraint is an inequality, the dual variable has to be non-negative. log_alpha takes values in [-inf; +inf], while alpha takes values in [0; +inf]

It looks like they use alpha and not log_alpha in the official SAC repo as well: https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L256

Trinkle23897 · 2022-03-08T23:12:29Z

unless we have evidence that our implementation performs better.

https://github.com/thu-ml/tianshou/tree/master/examples/mujoco#sac

Trinkle23897 added the question Further information is requested label Dec 8, 2020

Trinkle23897 closed this as completed Apr 14, 2021

This was referenced Mar 5, 2022

Possible mistake in alpha loss in SAC #557

Closed

Replace log_alpha with alpha in SAC alpha_loss #565

Merged

timoklein mentioned this issue Aug 21, 2023

use alpha not log alpha in autotune vwxyzjn/cleanrl#414

Merged

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem of auto alpha in SAC #258

Problem of auto alpha in SAC #258

milkpku commented Dec 8, 2020

Trinkle23897 commented Dec 8, 2020

milkpku commented Dec 8, 2020

dxing-cs commented Dec 13, 2020 •

edited

Loading

danagi commented Dec 13, 2020

dxing-cs commented Dec 13, 2020

danagi commented Dec 13, 2020

alexnikulkov commented Mar 8, 2022

Trinkle23897 commented Mar 8, 2022

Problem of auto alpha in SAC #258

Problem of auto alpha in SAC #258

Comments

milkpku commented Dec 8, 2020

Trinkle23897 commented Dec 8, 2020

milkpku commented Dec 8, 2020

dxing-cs commented Dec 13, 2020 • edited Loading

danagi commented Dec 13, 2020

dxing-cs commented Dec 13, 2020

danagi commented Dec 13, 2020

alexnikulkov commented Mar 8, 2022

Trinkle23897 commented Mar 8, 2022

dxing-cs commented Dec 13, 2020 •

edited

Loading