NaNs in generated quantities since 2.27.0 #3057

alyst · 2021-08-22T22:01:38Z

Summary:

Since Stan v2.27.0 my model generates NaNs in generated quantities section.
In v2.26.1 it works fine, also the parameters and transformed parameters blocks look fine with both versions.

Description:

I use Stan via cmdstanr. When I run the model (in NUTS mode) compiled with Stan v2.27.0 (both with and without MKL), I get tons of messages like

Chain 4 Exception: double_exponential_lpdf: Random variable is -nan, but must be finite! (in '/tmp/RtmpRY7her/model-f78267ee0eed8.stan', line 670, column 10 to line 671, column 143)

This refers to the generated quantities section, and it's related to the generated quantity on L630, which evaluates to NaN, although the values at the RHS are just fine.

This model works fine in Stan v2.26.1 and versions before.

Reproducible Steps:

I've uploaded my model and the data.
Unfortunately, the model is rather big.
I've tried to come up with some minimal reproducible example, but very simple examples with single vector variable work just fine.
However, the fact that I'm doing sparse matrix multiplication at L630 is not essential. Simple indexing expressions like

vector[Niactions] iaction_labu = obj_base_labu[iaction2obj]

also generate NaNs.

Current Output:

NaNs for generated quantities. I can provide the stan output file.

Expected Output:

The generated quantities are properly calculated and don't contain NaNs.

Additional Information:

Provide any additional information here.

Current Version:

v2.27.0

The text was updated successfully, but these errors were encountered:

rok-cesnovar · 2021-08-23T11:59:52Z

Can you try running the model and just print everything:

vector[Niactions] iaction_labu = obj_base_labu[iaction2obj];
print(obj_base_labu);
print(iaction2obj);
print(iaction_labu);

alyst · 2021-08-23T13:45:18Z

Can you try running the model and just print everything:

I'm also printing:

    print(iaction2obj);
    print(obj_base_labu);
    print(iaction_labu);
    print(iaction_labu_replCI);
    print(iact_repl_shift_sigma);

Here's the excerpt:

Chain 2 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] 
Chain 2 [-17.9468] 
Chain 2 [-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468] 
Chain 2 [-17.855,-18.0153,-18.0043,-17.8235,-17.9455,-17.9057,-17.9792,-17.9,-17.9515,-17.7608,-17.8635,-17.9856,-17.95,-17.8886,-17.9244,-18.0805,-18.0701,-17.9896,-17.8534,-17.916,-17.9039,-17.8888,-17.9868,-17.9199,-17.875,-17.9773,-17.9365,-18.0336,-17.9753,-17.9545,-17.9391,-18.0071,-17.9655,-17.9173,-17.8501,-17.9067,-18.0192,-18.4609,-17.8652,-18.0197,-18.0445,-17.9338,-17.4626,-18.1821,-17.9559,-17.9458,-17.9455,-17.0407,-17.9453,-17.9677,-17.9464,-17.94,-17.9981,-17.9443,-17.9918,-17.9594,-17.9788,-17.9085] 
Chain 2 [0.0574907,0.046585,0.0564328,0.109842,0.00449232,0.0916095,0.0417634,0.0555958,0.107741,0.381183,0.257974,0.0659277,0.0165423,0.107394,0.178859,0.0929335,0.049117,0.0808125,0.159419,0.0575193,0.123784,0.0632653,0.02198,0.0327219,0.0912894,0.0359584,0.0608553,0.0742691,0.064116,0.109115,0.0111847,0.0613069,0.0335359,0.0406772,0.203661,0.0228246,0.10307,0.45007,0.0498263,0.0881806,0.140508,0.0531773,0.257451,0.52,0.0203236,0.00638217,0.0557117,0.683095,0.0193757,0.0363383,0.00817593,0.0594832,0.0527237,0.0112367,0.0696364,0.080595,0.0283944,0.0557641] 
Chain 2 Exception: double_exponential_lpdf: Random variable is -nan, but must be finite! (in '/tmp/RtmpfAZQAg/model-fe3016dad7ca8.stan', line 676, column 10 to line 677, column 143)

And this looks fine. However, in the chain output there are mostly NaNs for iaction_labu and the other GQs and only sporadically some finite numbers. For example, the sample -17.9468 (which you see in this debug output) is not there.

alyst · 2021-09-20T17:00:09Z

Gentle ping.
I've attached stanoutput.zip that contains stanfit objects for v2.26.1 and v.2.27.0 for the provided model and input data.
It turns out that not only the generated quantities have NaNs, but also the model fails to converge in v2.27.0 (Rhat > 1.1, also maxtree depth problems), while everything is fine for v2.26.1.
So for me this issue looks pretty serious, because the convergence and the inference results of the other models might be affected as well.

wds15 · 2021-09-20T20:29:44Z

I can confirm that things are really odd here and I do agree that this is worrying! So when running this model with 2.26.1 then I do get no exceptions thrown during the sampling phase and all Rhats are 1.0. When running the model with 2.27.0 then I do get a huge amount of time this message (during sampling only):

Chain 1 Exception: double_exponential_lpdf: Random variable is nan, but must be finite! (in '/var/folders/pn/96gtnbqd10j97x7llnjl5k6r0000gn/T/RtmpPpJpzy/model-16f5b789e53de.stan', line 670, column 10 to line 671, column 143)

... and even worse, the Rhats are all crap (much bigger than 1.0 as they are with 2.26.1).

From skimming over the model it uses a lot of Eigen expressions. @SteveBronder do you recall if anything changed in this regard from 2.26 to 2.27 and if expressions could be the root cause. Also pinging @t4c1, just in case.

I'd think we should fix this issue before releasing 2.28.

@alyst Can you think of a way of how one can very quickly tickle the buggy behaviour with 2.27 (while not tickling it with 2.26)? E.g. use a good initial and few samples. This way we should be able to quickly bisect the commits between 2.26 and 2.27... I have never done that... @rok-cesnovar do we have some Jenkins facility for that or how would that work?

@alyst You could try to create a debug version of the model by these steps (for example):

strip out generated quantities and see if then Rhats are still off as a way to see if things still go wrong.
try to replace all the vector expressions by arrays using element wise loops mostly. That's slow, but would make the point that Eigen expressions are to blame.

wds15 · 2021-09-20T20:32:14Z

Actually... we do have a BUG here which needs fixing. The gradients are crap with 2.27. The diagnose utility with 2.27 gives me

       552        -1.17838        0.135755         1.17838        -1.04263
       553       -0.205238               0        0.205128       -0.205128
       554       -0.927986      -0.0726003        0.927909        -1.00051
       555       -0.928954        0.135755        0.928922       -0.793167
       556       -0.710726               0        0.710726       -0.710726
       557        0.476948      -0.0726003       -0.476948        0.404348
       558        -1.80476        0.135755         1.80476        -1.66901
       559        -1.99413       0.0997067         1.99413        -1.89443
       560          0.7089       -0.181671         -0.7089        0.527229
       561       -0.444672        0.771751        0.444673        0.327078
       562        0.561057               0       -0.561055        0.561055
       563        0.681151       -0.181671       -0.681151         0.49948
       564        0.727104        0.771751       -0.727106         1.49886
       565       -0.745361               0        0.745363       -0.745363
       566       -0.663369       -0.181671        0.663369        -0.84504
       567        -1.99743        0.771751         1.99743        -1.22568
       568         1.90213        -43.8923        -24.2459        -19.6464

while with 2.26 they are ok:

       552         1.68477        -1.68477        -1.68477      2.8595e-07
       553        -1.65574         1.65573         1.65573     1.57776e-06
       554        -1.54169         1.54168         1.54168     2.79641e-08
       555        0.628847       -0.628849       -0.628849     5.54725e-07
       556       -0.542673        0.542685        0.542685     7.52982e-07
       557        -1.97117          1.9712          1.9712    -1.09882e-06
       558          1.8513        -1.85133        -1.85133    -5.78294e-08
       559         1.81816        -1.81816        -1.81816    -3.62035e-07
       560        0.308039       -0.308039       -0.308038     -1.0884e-06
       561        0.833769       -0.833769       -0.833769    -2.60753e-08
       562         1.29994        -1.29994        -1.29994    -3.90667e-07
       563        -1.30433         1.30433         1.30433    -6.55807e-07
       564         1.12525        -1.12525        -1.12525     -5.1266e-07
       565       -0.945488        0.945489        0.945489     3.02066e-08
       566         -1.2199          1.2199          1.2199     -1.7155e-07
       567       -0.483936        0.483936        0.483936     4.18004e-07
       568         1.82344        -16.6813        -16.6813    -2.69522e-07

This really needs fixing.

wds15 · 2021-09-20T20:37:43Z

I just wanted to build this under develop, but then I get this:

--- Translating Stan model to C++ code ---
bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp] Error 2

@WardBrian maybe... ideas?

wds15 · 2021-09-20T20:39:39Z

@alyst No more model simplification is needed... the cmdstan diagnose utility is enough. That one tests the autodiff gradients against finite differences and this clearly shows that 2.27 has bugs, which 2.26 does not have. Yack.

Thanks for reporting and thanks for being persistent!!!

WardBrian · 2021-09-20T20:41:50Z

I just wanted to build this under develop, but then I get this:

--- Translating Stan model to C++ code ---
bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp] Error 2

@WardBrian maybe... ideas?

That error looks like what I’d expect if I fed in something that wasn’t a stan model into the compiler, like an xml file

wds15 · 2021-09-20T20:49:09Z

bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan

is what triggers the above. The stan file is posted in the issue description above and the same call just works with 2.27 and 2.26... could you try to compile yourself?

WardBrian · 2021-09-20T20:53:44Z

Yeah, unable to recreate with a local build of stanc3/master

WardBrian · 2021-09-20T21:14:48Z

Here is the header file that produces: https://gist.github.com/WardBrian/db929d3edbc2bc7490a168f008e76b1a

wds15 · 2021-09-20T21:31:15Z

And why does develop fail to produce the hop? Is this an issue with stanc3?

WardBrian · 2021-09-20T21:39:06Z

Stanc3 doesn't use the develop name, master is the latest available code. I able to build the submitted stan file with the most recent stanc3 possible. I don't believe there is any issue in stanc3 at play here. Are you sure you didn't clobber the file while copying/renaming?

rok-cesnovar · 2021-09-21T07:10:42Z

I'd think we should fix this issue before releasing 2.28.

Agreed, we will postpone until we figure this out.

Hopefully now that we know we are able to use diagnose, this will be quick.

rok-cesnovar · 2021-09-21T07:22:36Z

I can confirm that I have no issue compiling with develop CmdStan that uses the nightly version of stanc3 (current master) and can see the issues with diagnose.

rok-cesnovar · 2021-09-21T09:37:52Z

The minimal example I can make is this:

transformed data {
  int<lower=0> N = 5;
  vector[N] obsXobs_shift0_w = [-0.67082,0.5,-0.223607,-0.223607,-0.5]';
  int obsXobs_shift0_v[N]  = {1,2,3,1,2};
  int obsXobs_shift0_u[N+1] = {1,2,3,4,5,6};
}

parameters {
  vector[N] obs_shift0;
}

model {
    vector[N] obs_repl_shift_unscaled;
    obs_repl_shift_unscaled = csr_matrix_times_vector(N, N, obsXobs_shift0_w, obsXobs_shift0_v, obsXobs_shift0_u, obs_shift0);
    obs_repl_shift_unscaled ~ std_normal();
}

you can also replace the model part with:

model {
    target += sum(csr_matrix_times_vector(N, N, obsXobs_shift0_w, obsXobs_shift0_v, obsXobs_shift0_u, obs_shift0));
}

I am fairly certain the issue is in csr_matrix_times_vector and reverting stan-dev/math#2462 seems to fix the issue (cmdstan on develop, math on the branch revert/2642).

@SteveBronder I dont have enough knowledge on the csr_* parts of the code. It would be great if you could replicate my findings. I would prefer finding a fix over a simple revert though.

wds15 · 2021-09-21T13:30:51Z

I am still having trouble with stanc3 master as of now. Maybe this is a Mac thing? I will wait for the RC 2.28 and try again.

Really cool that @rok-cesnovar found the culprit of this!

wds15 · 2021-09-21T13:41:55Z

ARGH...

[15:40:34][weberse2@C02XK2AGJHD2:~/work/cmdstan]$ make examples/bernoulli/bernoulli

--- Translating Stan model to C++ code ---
bin/stanc  --o=examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [examples/bernoulli/bernoulli.hpp] Error 2

... so not even the Bernoulli example works for me! Let's wait for the RC...

WardBrian · 2021-09-21T13:44:12Z

@wds15 can you try downloading the latest nightly for your system (https://github.com/stan-dev/stanc3/releases/tag/nightly) and running it directly (outside cmdstan)?

wds15 · 2021-09-21T15:22:37Z

Now it runs ok when I download as you suggest... and it shows that develop is affected by the bug as well.

SteveBronder · 2021-09-21T16:18:51Z

@rok-cesnovar yes I'm seeing a bug in the test for csr_matrix_times_vector I think I can put up a patch today. Should we do a backwards patch for 2.27 fixing this in math?

rok-cesnovar · 2021-09-21T16:21:29Z

Thanks!

Should we do a backwards patch for 2.27 fixing this in math?

I think fixing it for 2.28 is fine. At least that is how we handled things in the past with similar problems. We will prominently list this bugfix in the release notes though.

SteveBronder · 2021-09-21T21:55:09Z

Just posted the patch above which should fix this, @alyst sorry for the trouble! The issue was that in the 2.27 version I was basing the non-zero indices passed by the users as starting from zero and not 1 (which is rather egregious, I'm surprised we did not catch that). The patch above adds some tests to make sure the indexing is all correct

alyst mentioned this issue Aug 23, 2021

Always write doubles with decimal point for cmdstan input stan-dev/cmdstanr#539

Closed

2 tasks

rok-cesnovar mentioned this issue Sep 21, 2021

Release 2.28 checklist stan-dev/cmdstan#1037

Closed

23 tasks

SteveBronder mentioned this issue Sep 21, 2021

[Fix] csr_matrix_times_vector indexing stan-dev/math#2586

Merged

5 tasks

rok-cesnovar closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaNs in generated quantities since 2.27.0 #3057

NaNs in generated quantities since 2.27.0 #3057

alyst commented Aug 22, 2021

rok-cesnovar commented Aug 23, 2021

alyst commented Aug 23, 2021 •

edited

Loading

alyst commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

WardBrian commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

rok-cesnovar commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021 •

edited

Loading

wds15 commented Sep 21, 2021

wds15 commented Sep 21, 2021

WardBrian commented Sep 21, 2021

wds15 commented Sep 21, 2021

SteveBronder commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021

SteveBronder commented Sep 21, 2021

NaNs in generated quantities since 2.27.0 #3057

NaNs in generated quantities since 2.27.0 #3057

Comments

alyst commented Aug 22, 2021

Summary:

Description:

Reproducible Steps:

Current Output:

Expected Output:

Additional Information:

Current Version:

rok-cesnovar commented Aug 23, 2021

alyst commented Aug 23, 2021 • edited Loading

alyst commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

WardBrian commented Sep 20, 2021

wds15 commented Sep 20, 2021

WardBrian commented Sep 20, 2021

rok-cesnovar commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021 • edited Loading

wds15 commented Sep 21, 2021

wds15 commented Sep 21, 2021

WardBrian commented Sep 21, 2021

wds15 commented Sep 21, 2021

SteveBronder commented Sep 21, 2021

rok-cesnovar commented Sep 21, 2021

SteveBronder commented Sep 21, 2021

alyst commented Aug 23, 2021 •

edited

Loading

rok-cesnovar commented Sep 21, 2021 •

edited

Loading