[Question] Any idea why SAC loss would diverge? #50

redknightlois · 2019-04-25T15:16:22Z

I left it running for a few epochs, several times to ensure that it was not a fluke.

And SAC is collapsing to always choose the same action.

replay_buffer/size                       210000
trainer/QF1 Loss                              1.35779e+19
trainer/QF2 Loss                              1.34288e+19
trainer/Policy Loss                          -2.48799e+10
trainer/Q1 Predictions Mean                   2.33888e+10
trainer/Q1 Predictions Std                    3.70217e+09
trainer/Q1 Predictions Max                    3.68046e+10
trainer/Q1 Predictions Min                    1.31057e+10
trainer/Q2 Predictions Mean                   2.34333e+10
trainer/Q2 Predictions Std                    3.65296e+09
trainer/Q2 Predictions Max                    3.66932e+10
trainer/Q2 Predictions Min                    1.33272e+10
trainer/Q Targets Mean                        2.36857e+10
trainer/Q Targets Std                         4.52467e+09
trainer/Q Targets Max                         3.54759e+10
trainer/Q Targets Min                         0.224346
trainer/Log Pis Mean                          0.987727
trainer/Log Pis Std                           1.12239
trainer/Log Pis Max                           2.15324
trainer/Log Pis Min                          -4.0056
trainer/Policy mu Mean                        1.52476
trainer/Policy mu Std                         0.0895151
trainer/Policy mu Max                         1.62818
trainer/Policy mu Min                         1.37598
trainer/Policy log std Mean                  -0.582497
trainer/Policy log std Std                    0.0243203
trainer/Policy log std Max                   -0.492316
trainer/Policy log std Min                   -0.640244
trainer/Alpha                                 5.56742e+08
trainer/Alpha Loss                            0.247146
exploration/num steps total                   2.491e+06
exploration/num paths total               23586
exploration/path length Mean                131.579
exploration/path length Std                  57.1612
exploration/path length Max                 200
exploration/path length Min                   8
exploration/Rewards Mean                      0.264324
exploration/Rewards Std                       0.149922
exploration/Rewards Max                       0.590382
exploration/Rewards Min                       0.0141083
exploration/Returns Mean                     34.7795
exploration/Returns Std                      23.3818
exploration/Returns Max                      83.2558
exploration/Returns Min                       2.15501
exploration/Actions Mean                      0.4906
exploration/Actions Std                       0.0686414
exploration/Actions Max                       0.5
exploration/Actions Min                      -0.5
exploration/Num Paths                        38
exploration/Average Returns                  34.7795
exploration/env_infos/final/time Mean         0.342105
exploration/env_infos/final/time Std          0.285806
exploration/env_infos/final/time Max          0.96
exploration/env_infos/final/time Min          0
exploration/env_infos/initial/time Mean       0.995
exploration/env_infos/initial/time Std        3.33067e-16
exploration/env_infos/initial/time Max        0.995
exploration/env_infos/initial/time Min        0.995
exploration/env_infos/time Mean               0.606472
exploration/env_infos/time Std                0.263458
exploration/env_infos/time Max                0.995
exploration/env_infos/time Min                0
evaluation/num steps total                    2.45463e+06
evaluation/num paths total                21675
evaluation/path length Mean                 115.452
evaluation/path length Std                   52.2554
evaluation/path length Max                  200
evaluation/path length Min                    9
evaluation/Rewards Mean                       0.248655
evaluation/Rewards Std                        0.0242211
evaluation/Rewards Max                        0.294154
evaluation/Rewards Min                        0.193703
evaluation/Returns Mean                      28.7078
evaluation/Returns Std                       12.9204
evaluation/Returns Max                       52.5658
evaluation/Returns Min                        2.53809
evaluation/Actions Mean                       0.5
evaluation/Actions Std                        0
evaluation/Actions Max                        0.5
evaluation/Actions Min                        0.5
evaluation/Num Paths                         42
evaluation/Average Returns                   28.7078
evaluation/env_infos/final/time Mean          0.422738
evaluation/env_infos/final/time Std           0.261277
evaluation/env_infos/final/time Max           0.955
evaluation/env_infos/final/time Min           0
evaluation/env_infos/initial/time Mean        0.995
evaluation/env_infos/initial/time Std         2.22045e-16
evaluation/env_infos/initial/time Max         0.995
evaluation/env_infos/initial/time Min         0.995
evaluation/env_infos/time Mean                0.64974
evaluation/env_infos/time Std                 0.245087
evaluation/env_infos/time Max                 0.995
evaluation/env_infos/time Min                 0
time/data storing (s)                         0.0476881
time/evaluation sampling (s)                 13.4834
time/exploration sampling (s)                15.2477
time/logging (s)                              0.0254512
time/saving (s)                               0.0218989
time/training (s)                           111.327
time/epoch (s)                              140.153
time/total (s)                            68869.7
Epoch                                       497

Running it from master. Could it be related to the 'action' state to be somewhat discrete? The environment will discretize the actions in 'x' states based on the input data.

The text was updated successfully, but these errors were encountered:

vitchyr · 2019-04-25T16:57:54Z

A bit hard to figure out based on just this. What did you set as the target entropy? An alpha of 5.56742e+08 seems rather large.

redknightlois · 2019-04-25T17:01:46Z

You are right, that is not an underflow... that is divergence (its a +)...
Didn't set it up explicitly so that would be: self.target_entropy = -np.prod((1,)).item()

vitchyr · 2019-04-25T17:07:34Z

Are you using discrete actions? That heuristic wouldn't work in that case

redknightlois · 2019-04-25T17:13:47Z

It behaves like discrete yes. I didn't change the policy to account for optimizing for Softmax because I was not able to figure out how to derive the temperature based sampling/exploration from the equations. Everybody sais it is easy, but no one shows how to do so :D (ex. openai/spinningup#22 )

So I hacked it instead, making the environment to understand the continuous actions as discrete signals. It was bound to have some 'side-effect'. And now that you noticed it is actually diverging it makes sense. For a typical 3 states (softmax(3)), what target entropy would you suggest to try?

vitchyr · 2019-04-25T18:47:41Z

For discrete actions, you should choose a positive number that's less than log(# of actions).

To compute the entropy, you should look up the definition of entropy. For discrete actions it's sum of p log(p).

redknightlois · 2019-04-25T18:51:21Z

First time with an entropy-based algorithm, so clueless on that... Any accessible writeup that you know would be great to read.

Would you go with close to log(3) (~0.477) or closer to zero instead?

redknightlois · 2019-04-26T14:32:08Z

OK, now that I changed the entropy to 0.35 I dont see the divergent behavior, but what I do see is a collapse on the deterministic policy results. The strange fact is that if I restart the process loading the last policy I get behaviors similar to those found in the exploration phase of the epoch. Sounds like a bug in the evaluation part, could that be?

vitchyr · 2019-04-27T17:36:34Z

The evaluation code and exploration code are the same. They just use the DataCollector. Note that if you're loading up the policy, you might be loading the evaluation policy, which is deterministic. It sounds like the SAC loss issue has been resolved, so I'm closing this issue.

redknightlois changed the title ~~[Question] Any idea why loss would underflow on SAC?~~ [Question] Any idea why loss would SAC diverge? Apr 25, 2019

redknightlois changed the title ~~[Question] Any idea why loss would SAC diverge?~~ [Question] Any idea why SAC loss would diverge? Apr 25, 2019

vitchyr closed this as completed Apr 27, 2019

wcarvalho mentioned this issue Aug 15, 2019

alpha convergence issues with discrete actions #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Any idea why SAC loss would diverge? #50

[Question] Any idea why SAC loss would diverge? #50

redknightlois commented Apr 25, 2019

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019 •

edited

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019 •

edited

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019

redknightlois commented Apr 26, 2019

vitchyr commented Apr 27, 2019

[Question] Any idea why SAC loss would diverge? #50

[Question] Any idea why SAC loss would diverge? #50

Comments

redknightlois commented Apr 25, 2019

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019 • edited

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019 • edited

vitchyr commented Apr 25, 2019

redknightlois commented Apr 25, 2019

redknightlois commented Apr 26, 2019

vitchyr commented Apr 27, 2019

redknightlois commented Apr 25, 2019 •

edited

redknightlois commented Apr 25, 2019 •

edited