Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs in generated quantities since 2.27.0 #3057

Closed
alyst opened this issue Aug 22, 2021 · 23 comments
Closed

NaNs in generated quantities since 2.27.0 #3057

alyst opened this issue Aug 22, 2021 · 23 comments

Comments

@alyst
Copy link
Contributor

alyst commented Aug 22, 2021

Summary:

Since Stan v2.27.0 my model generates NaNs in generated quantities section.
In v2.26.1 it works fine, also the parameters and transformed parameters blocks look fine with both versions.

Description:

I use Stan via cmdstanr. When I run the model (in NUTS mode) compiled with Stan v2.27.0 (both with and without MKL), I get tons of messages like

Chain 4 Exception: double_exponential_lpdf: Random variable is -nan, but must be finite! (in '/tmp/RtmpRY7her/model-f78267ee0eed8.stan', line 670, column 10 to line 671, column 143)

This refers to the generated quantities section, and it's related to the generated quantity on L630, which evaluates to NaN, although the values at the RHS are just fine.

This model works fine in Stan v2.26.1 and versions before.

Reproducible Steps:

I've uploaded my model and the data.
Unfortunately, the model is rather big.
I've tried to come up with some minimal reproducible example, but very simple examples with single vector variable work just fine.
However, the fact that I'm doing sparse matrix multiplication at L630 is not essential. Simple indexing expressions like

vector[Niactions] iaction_labu = obj_base_labu[iaction2obj]

also generate NaNs.

Current Output:

NaNs for generated quantities. I can provide the stan output file.

Expected Output:

The generated quantities are properly calculated and don't contain NaNs.

Additional Information:

Provide any additional information here.

Current Version:

v2.27.0

@rok-cesnovar
Copy link
Member

Can you try running the model and just print everything:

vector[Niactions] iaction_labu = obj_base_labu[iaction2obj];
print(obj_base_labu);
print(iaction2obj);
print(iaction_labu);

@alyst
Copy link
Contributor Author

alyst commented Aug 23, 2021

Can you try running the model and just print everything:

I'm also printing:

    print(iaction2obj);
    print(obj_base_labu);
    print(iaction_labu);
    print(iaction_labu_replCI);
    print(iact_repl_shift_sigma);

Here's the excerpt:

Chain 2 [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] 
Chain 2 [-17.9468] 
Chain 2 [-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468,-17.9468] 
Chain 2 [-17.855,-18.0153,-18.0043,-17.8235,-17.9455,-17.9057,-17.9792,-17.9,-17.9515,-17.7608,-17.8635,-17.9856,-17.95,-17.8886,-17.9244,-18.0805,-18.0701,-17.9896,-17.8534,-17.916,-17.9039,-17.8888,-17.9868,-17.9199,-17.875,-17.9773,-17.9365,-18.0336,-17.9753,-17.9545,-17.9391,-18.0071,-17.9655,-17.9173,-17.8501,-17.9067,-18.0192,-18.4609,-17.8652,-18.0197,-18.0445,-17.9338,-17.4626,-18.1821,-17.9559,-17.9458,-17.9455,-17.0407,-17.9453,-17.9677,-17.9464,-17.94,-17.9981,-17.9443,-17.9918,-17.9594,-17.9788,-17.9085] 
Chain 2 [0.0574907,0.046585,0.0564328,0.109842,0.00449232,0.0916095,0.0417634,0.0555958,0.107741,0.381183,0.257974,0.0659277,0.0165423,0.107394,0.178859,0.0929335,0.049117,0.0808125,0.159419,0.0575193,0.123784,0.0632653,0.02198,0.0327219,0.0912894,0.0359584,0.0608553,0.0742691,0.064116,0.109115,0.0111847,0.0613069,0.0335359,0.0406772,0.203661,0.0228246,0.10307,0.45007,0.0498263,0.0881806,0.140508,0.0531773,0.257451,0.52,0.0203236,0.00638217,0.0557117,0.683095,0.0193757,0.0363383,0.00817593,0.0594832,0.0527237,0.0112367,0.0696364,0.080595,0.0283944,0.0557641] 
Chain 2 Exception: double_exponential_lpdf: Random variable is -nan, but must be finite! (in '/tmp/RtmpfAZQAg/model-fe3016dad7ca8.stan', line 676, column 10 to line 677, column 143) 

And this looks fine. However, in the chain output there are mostly NaNs for iaction_labu and the other GQs and only sporadically some finite numbers. For example, the sample -17.9468 (which you see in this debug output) is not there.

@alyst
Copy link
Contributor Author

alyst commented Sep 20, 2021

Gentle ping.
I've attached stanoutput.zip that contains stanfit objects for v2.26.1 and v.2.27.0 for the provided model and input data.
It turns out that not only the generated quantities have NaNs, but also the model fails to converge in v2.27.0 (Rhat > 1.1, also maxtree depth problems), while everything is fine for v2.26.1.
So for me this issue looks pretty serious, because the convergence and the inference results of the other models might be affected as well.

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

I can confirm that things are really odd here and I do agree that this is worrying! So when running this model with 2.26.1 then I do get no exceptions thrown during the sampling phase and all Rhats are 1.0. When running the model with 2.27.0 then I do get a huge amount of time this message (during sampling only):

Chain 1 Exception: double_exponential_lpdf: Random variable is nan, but must be finite! (in '/var/folders/pn/96gtnbqd10j97x7llnjl5k6r0000gn/T/RtmpPpJpzy/model-16f5b789e53de.stan', line 670, column 10 to line 671, column 143) 

... and even worse, the Rhats are all crap (much bigger than 1.0 as they are with 2.26.1).

From skimming over the model it uses a lot of Eigen expressions. @SteveBronder do you recall if anything changed in this regard from 2.26 to 2.27 and if expressions could be the root cause. Also pinging @t4c1, just in case.

I'd think we should fix this issue before releasing 2.28.

@alyst Can you think of a way of how one can very quickly tickle the buggy behaviour with 2.27 (while not tickling it with 2.26)? E.g. use a good initial and few samples. This way we should be able to quickly bisect the commits between 2.26 and 2.27... I have never done that... @rok-cesnovar do we have some Jenkins facility for that or how would that work?

@alyst You could try to create a debug version of the model by these steps (for example):

  • strip out generated quantities and see if then Rhats are still off as a way to see if things still go wrong.
  • try to replace all the vector expressions by arrays using element wise loops mostly. That's slow, but would make the point that Eigen expressions are to blame.

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

Actually... we do have a BUG here which needs fixing. The gradients are crap with 2.27. The diagnose utility with 2.27 gives me

       552        -1.17838        0.135755         1.17838        -1.04263
       553       -0.205238               0        0.205128       -0.205128
       554       -0.927986      -0.0726003        0.927909        -1.00051
       555       -0.928954        0.135755        0.928922       -0.793167
       556       -0.710726               0        0.710726       -0.710726
       557        0.476948      -0.0726003       -0.476948        0.404348
       558        -1.80476        0.135755         1.80476        -1.66901
       559        -1.99413       0.0997067         1.99413        -1.89443
       560          0.7089       -0.181671         -0.7089        0.527229
       561       -0.444672        0.771751        0.444673        0.327078
       562        0.561057               0       -0.561055        0.561055
       563        0.681151       -0.181671       -0.681151         0.49948
       564        0.727104        0.771751       -0.727106         1.49886
       565       -0.745361               0        0.745363       -0.745363
       566       -0.663369       -0.181671        0.663369        -0.84504
       567        -1.99743        0.771751         1.99743        -1.22568
       568         1.90213        -43.8923        -24.2459        -19.6464

while with 2.26 they are ok:

       552         1.68477        -1.68477        -1.68477      2.8595e-07
       553        -1.65574         1.65573         1.65573     1.57776e-06
       554        -1.54169         1.54168         1.54168     2.79641e-08
       555        0.628847       -0.628849       -0.628849     5.54725e-07
       556       -0.542673        0.542685        0.542685     7.52982e-07
       557        -1.97117          1.9712          1.9712    -1.09882e-06
       558          1.8513        -1.85133        -1.85133    -5.78294e-08
       559         1.81816        -1.81816        -1.81816    -3.62035e-07
       560        0.308039       -0.308039       -0.308038     -1.0884e-06
       561        0.833769       -0.833769       -0.833769    -2.60753e-08
       562         1.29994        -1.29994        -1.29994    -3.90667e-07
       563        -1.30433         1.30433         1.30433    -6.55807e-07
       564         1.12525        -1.12525        -1.12525     -5.1266e-07
       565       -0.945488        0.945489        0.945489     3.02066e-08
       566         -1.2199          1.2199          1.2199     -1.7155e-07
       567       -0.483936        0.483936        0.483936     4.18004e-07
       568         1.82344        -16.6813        -16.6813    -2.69522e-07

This really needs fixing.

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

I just wanted to build this under develop, but then I get this:

--- Translating Stan model to C++ code ---
bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp] Error 2

@WardBrian maybe... ideas?

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

@alyst No more model simplification is needed... the cmdstan diagnose utility is enough. That one tests the autodiff gradients against finite differences and this clearly shows that 2.27 has bugs, which 2.26 does not have. Yack.

Thanks for reporting and thanks for being persistent!!!

@WardBrian
Copy link
Member

I just wanted to build this under develop, but then I get this:

--- Translating Stan model to C++ code ---
bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp] Error 2

@WardBrian maybe... ideas?

That error looks like what I’d expect if I fed in something that wasn’t a stan model into the compiler, like an xml file

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

bin/stanc  --o=/Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.hpp /Users/weberse2/work/bugs/cmdstan-issue/msglm_local_subobjects-develop.stan

is what triggers the above. The stan file is posted in the issue description above and the same call just works with 2.27 and 2.26... could you try to compile yourself?

@WardBrian
Copy link
Member

Yeah, unable to recreate with a local build of stanc3/master

@WardBrian
Copy link
Member

Here is the header file that produces: https://gist.github.com/WardBrian/db929d3edbc2bc7490a168f008e76b1a

@wds15
Copy link
Contributor

wds15 commented Sep 20, 2021

And why does develop fail to produce the hop? Is this an issue with stanc3?

@WardBrian
Copy link
Member

Stanc3 doesn't use the develop name, master is the latest available code. I able to build the submitted stan file with the most recent stanc3 possible. I don't believe there is any issue in stanc3 at play here. Are you sure you didn't clobber the file while copying/renaming?

@rok-cesnovar
Copy link
Member

I'd think we should fix this issue before releasing 2.28.

Agreed, we will postpone until we figure this out.

Hopefully now that we know we are able to use diagnose, this will be quick.

@rok-cesnovar
Copy link
Member

I can confirm that I have no issue compiling with develop CmdStan that uses the nightly version of stanc3 (current master) and can see the issues with diagnose.

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Sep 21, 2021

The minimal example I can make is this:

transformed data {
  int<lower=0> N = 5;
  vector[N] obsXobs_shift0_w = [-0.67082,0.5,-0.223607,-0.223607,-0.5]';
  int obsXobs_shift0_v[N]  = {1,2,3,1,2};
  int obsXobs_shift0_u[N+1] = {1,2,3,4,5,6};
}

parameters {
  vector[N] obs_shift0;
}

model {
    vector[N] obs_repl_shift_unscaled;
    obs_repl_shift_unscaled = csr_matrix_times_vector(N, N, obsXobs_shift0_w, obsXobs_shift0_v, obsXobs_shift0_u, obs_shift0);
    obs_repl_shift_unscaled ~ std_normal();
}

you can also replace the model part with:

model {
    target += sum(csr_matrix_times_vector(N, N, obsXobs_shift0_w, obsXobs_shift0_v, obsXobs_shift0_u, obs_shift0));
}

I am fairly certain the issue is in csr_matrix_times_vector and reverting stan-dev/math#2462 seems to fix the issue (cmdstan on develop, math on the branch revert/2642).

@SteveBronder I dont have enough knowledge on the csr_* parts of the code. It would be great if you could replicate my findings. I would prefer finding a fix over a simple revert though.

@wds15
Copy link
Contributor

wds15 commented Sep 21, 2021

I am still having trouble with stanc3 master as of now. Maybe this is a Mac thing? I will wait for the RC 2.28 and try again.

Really cool that @rok-cesnovar found the culprit of this!

@wds15
Copy link
Contributor

wds15 commented Sep 21, 2021

ARGH...

[15:40:34][weberse2@C02XK2AGJHD2:~/work/cmdstan]$ make examples/bernoulli/bernoulli

--- Translating Stan model to C++ code ---
bin/stanc  --o=examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.stan
bin/stanc: line 1: syntax error near unexpected token `newline'
bin/stanc: line 1: `<?xml version="1.0" encoding="UTF-8"?>'
make: *** [examples/bernoulli/bernoulli.hpp] Error 2

... so not even the Bernoulli example works for me! Let's wait for the RC...

@WardBrian
Copy link
Member

@wds15 can you try downloading the latest nightly for your system (https://github.com/stan-dev/stanc3/releases/tag/nightly) and running it directly (outside cmdstan)?

@wds15
Copy link
Contributor

wds15 commented Sep 21, 2021

Now it runs ok when I download as you suggest... and it shows that develop is affected by the bug as well.

@SteveBronder
Copy link
Collaborator

@rok-cesnovar yes I'm seeing a bug in the test for csr_matrix_times_vector I think I can put up a patch today. Should we do a backwards patch for 2.27 fixing this in math?

@rok-cesnovar
Copy link
Member

Thanks!

Should we do a backwards patch for 2.27 fixing this in math?

I think fixing it for 2.28 is fine. At least that is how we handled things in the past with similar problems. We will prominently list this bugfix in the release notes though.

@SteveBronder
Copy link
Collaborator

Just posted the patch above which should fix this, @alyst sorry for the trouble! The issue was that in the 2.27 version I was basing the non-zero indices passed by the users as starting from zero and not 1 (which is rather egregious, I'm surprised we did not catch that). The patch above adds some tests to make sure the indexing is all correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants