Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Any idea why SAC loss would diverge? #50

Closed
redknightlois opened this issue Apr 25, 2019 · 8 comments
Closed

[Question] Any idea why SAC loss would diverge? #50

redknightlois opened this issue Apr 25, 2019 · 8 comments

Comments

@redknightlois
Copy link
Contributor

I left it running for a few epochs, several times to ensure that it was not a fluke.

And SAC is collapsing to always choose the same action.

replay_buffer/size                       210000
trainer/QF1 Loss                              1.35779e+19
trainer/QF2 Loss                              1.34288e+19
trainer/Policy Loss                          -2.48799e+10
trainer/Q1 Predictions Mean                   2.33888e+10
trainer/Q1 Predictions Std                    3.70217e+09
trainer/Q1 Predictions Max                    3.68046e+10
trainer/Q1 Predictions Min                    1.31057e+10
trainer/Q2 Predictions Mean                   2.34333e+10
trainer/Q2 Predictions Std                    3.65296e+09
trainer/Q2 Predictions Max                    3.66932e+10
trainer/Q2 Predictions Min                    1.33272e+10
trainer/Q Targets Mean                        2.36857e+10
trainer/Q Targets Std                         4.52467e+09
trainer/Q Targets Max                         3.54759e+10
trainer/Q Targets Min                         0.224346
trainer/Log Pis Mean                          0.987727
trainer/Log Pis Std                           1.12239
trainer/Log Pis Max                           2.15324
trainer/Log Pis Min                          -4.0056
trainer/Policy mu Mean                        1.52476
trainer/Policy mu Std                         0.0895151
trainer/Policy mu Max                         1.62818
trainer/Policy mu Min                         1.37598
trainer/Policy log std Mean                  -0.582497
trainer/Policy log std Std                    0.0243203
trainer/Policy log std Max                   -0.492316
trainer/Policy log std Min                   -0.640244
trainer/Alpha                                 5.56742e+08
trainer/Alpha Loss                            0.247146
exploration/num steps total                   2.491e+06
exploration/num paths total               23586
exploration/path length Mean                131.579
exploration/path length Std                  57.1612
exploration/path length Max                 200
exploration/path length Min                   8
exploration/Rewards Mean                      0.264324
exploration/Rewards Std                       0.149922
exploration/Rewards Max                       0.590382
exploration/Rewards Min                       0.0141083
exploration/Returns Mean                     34.7795
exploration/Returns Std                      23.3818
exploration/Returns Max                      83.2558
exploration/Returns Min                       2.15501
exploration/Actions Mean                      0.4906
exploration/Actions Std                       0.0686414
exploration/Actions Max                       0.5
exploration/Actions Min                      -0.5
exploration/Num Paths                        38
exploration/Average Returns                  34.7795
exploration/env_infos/final/time Mean         0.342105
exploration/env_infos/final/time Std          0.285806
exploration/env_infos/final/time Max          0.96
exploration/env_infos/final/time Min          0
exploration/env_infos/initial/time Mean       0.995
exploration/env_infos/initial/time Std        3.33067e-16
exploration/env_infos/initial/time Max        0.995
exploration/env_infos/initial/time Min        0.995
exploration/env_infos/time Mean               0.606472
exploration/env_infos/time Std                0.263458
exploration/env_infos/time Max                0.995
exploration/env_infos/time Min                0
evaluation/num steps total                    2.45463e+06
evaluation/num paths total                21675
evaluation/path length Mean                 115.452
evaluation/path length Std                   52.2554
evaluation/path length Max                  200
evaluation/path length Min                    9
evaluation/Rewards Mean                       0.248655
evaluation/Rewards Std                        0.0242211
evaluation/Rewards Max                        0.294154
evaluation/Rewards Min                        0.193703
evaluation/Returns Mean                      28.7078
evaluation/Returns Std                       12.9204
evaluation/Returns Max                       52.5658
evaluation/Returns Min                        2.53809
evaluation/Actions Mean                       0.5
evaluation/Actions Std                        0
evaluation/Actions Max                        0.5
evaluation/Actions Min                        0.5
evaluation/Num Paths                         42
evaluation/Average Returns                   28.7078
evaluation/env_infos/final/time Mean          0.422738
evaluation/env_infos/final/time Std           0.261277
evaluation/env_infos/final/time Max           0.955
evaluation/env_infos/final/time Min           0
evaluation/env_infos/initial/time Mean        0.995
evaluation/env_infos/initial/time Std         2.22045e-16
evaluation/env_infos/initial/time Max         0.995
evaluation/env_infos/initial/time Min         0.995
evaluation/env_infos/time Mean                0.64974
evaluation/env_infos/time Std                 0.245087
evaluation/env_infos/time Max                 0.995
evaluation/env_infos/time Min                 0
time/data storing (s)                         0.0476881
time/evaluation sampling (s)                 13.4834
time/exploration sampling (s)                15.2477
time/logging (s)                              0.0254512
time/saving (s)                               0.0218989
time/training (s)                           111.327
time/epoch (s)                              140.153
time/total (s)                            68869.7
Epoch                                       497

Running it from master. Could it be related to the 'action' state to be somewhat discrete? The environment will discretize the actions in 'x' states based on the input data.

@vitchyr
Copy link
Collaborator

vitchyr commented Apr 25, 2019

A bit hard to figure out based on just this. What did you set as the target entropy? An alpha of 5.56742e+08 seems rather large.

@redknightlois
Copy link
Contributor Author

redknightlois commented Apr 25, 2019

You are right, that is not an underflow... that is divergence (its a +)...
Didn't set it up explicitly so that would be: self.target_entropy = -np.prod((1,)).item()

@redknightlois redknightlois changed the title [Question] Any idea why loss would underflow on SAC? [Question] Any idea why loss would SAC diverge? Apr 25, 2019
@redknightlois redknightlois changed the title [Question] Any idea why loss would SAC diverge? [Question] Any idea why SAC loss would diverge? Apr 25, 2019
@vitchyr
Copy link
Collaborator

vitchyr commented Apr 25, 2019

Are you using discrete actions? That heuristic wouldn't work in that case

@redknightlois
Copy link
Contributor Author

redknightlois commented Apr 25, 2019

It behaves like discrete yes. I didn't change the policy to account for optimizing for Softmax because I was not able to figure out how to derive the temperature based sampling/exploration from the equations. Everybody sais it is easy, but no one shows how to do so :D (ex. openai/spinningup#22 )

So I hacked it instead, making the environment to understand the continuous actions as discrete signals. It was bound to have some 'side-effect'. And now that you noticed it is actually diverging it makes sense. For a typical 3 states (softmax(3)), what target entropy would you suggest to try?

@vitchyr
Copy link
Collaborator

vitchyr commented Apr 25, 2019

For discrete actions, you should choose a positive number that's less than log(# of actions).

To compute the entropy, you should look up the definition of entropy. For discrete actions it's sum of p log(p).

@redknightlois
Copy link
Contributor Author

First time with an entropy-based algorithm, so clueless on that... Any accessible writeup that you know would be great to read.

Would you go with close to log(3) (~0.477) or closer to zero instead?

@redknightlois
Copy link
Contributor Author

OK, now that I changed the entropy to 0.35 I dont see the divergent behavior, but what I do see is a collapse on the deterministic policy results. The strange fact is that if I restart the process loading the last policy I get behaviors similar to those found in the exploration phase of the epoch. Sounds like a bug in the evaluation part, could that be?

@vitchyr
Copy link
Collaborator

vitchyr commented Apr 27, 2019

The evaluation code and exploration code are the same. They just use the DataCollector. Note that if you're loading up the policy, you might be loading the evaluation policy, which is deterministic. It sounds like the SAC loss issue has been resolved, so I'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants