-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Any idea why SAC loss would diverge? #50
Comments
A bit hard to figure out based on just this. What did you set as the target entropy? An alpha of |
You are right, that is not an underflow... that is divergence (its a +)... |
Are you using discrete actions? That heuristic wouldn't work in that case |
It behaves like discrete yes. I didn't change the policy to account for optimizing for Softmax because I was not able to figure out how to derive the temperature based sampling/exploration from the equations. Everybody sais it is easy, but no one shows how to do so :D (ex. openai/spinningup#22 ) So I hacked it instead, making the environment to understand the continuous actions as discrete signals. It was bound to have some 'side-effect'. And now that you noticed it is actually diverging it makes sense. For a typical 3 states ( |
For discrete actions, you should choose a positive number that's less than log(# of actions). To compute the entropy, you should look up the definition of entropy. For discrete actions it's sum of p log(p). |
First time with an entropy-based algorithm, so clueless on that... Any accessible writeup that you know would be great to read. Would you go with close to |
OK, now that I changed the entropy to 0.35 I dont see the divergent behavior, but what I do see is a collapse on the deterministic policy results. The strange fact is that if I restart the process loading the last policy I get behaviors similar to those found in the exploration phase of the epoch. Sounds like a bug in the evaluation part, could that be? |
The evaluation code and exploration code are the same. They just use the |
I left it running for a few epochs, several times to ensure that it was not a fluke.
And SAC is collapsing to always choose the same action.
Running it from master. Could it be related to the 'action' state to be somewhat discrete? The environment will discretize the actions in 'x' states based on the input data.
The text was updated successfully, but these errors were encountered: