Add convergence warnings #77

paul-buerkner · 2020-06-19T16:34:23Z

The current convergence warnings in our packages (e.g., in rstan and brms) are inconsistent and partly confusing to users. Examples are given in https://discourse.mc-stan.org/t/improve-warnings-for-low-ess/15355. Other examples include warnings because of rhat or ess being NA, which is most likely caused by constant parameters, which does not pose a problem if the parameters are supposed to be constant (e.g., a correlation of a variable with itself).

We should move all of these warning messages to posterior, make them consistent and improve their informativeness, by mentioning different reasons for the potential problems and pointing to more detailed doc.

jgabry · 2020-06-19T19:14:15Z

Yeah, at least for the rhat and ess warnings this makes sense. But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific? Or did you want to broaden the scope of posterior to summarize sampler diagnostics?

paul-buerkner · 2020-06-19T19:18:08Z

I didnt plan on including algorithm specific diagnostics. @avehtari initially suggested moving the general warnings to posterior (which I agree with). I would like to hear his take on the algorithm specific diagnostics as well. Jonah Gabry <notifications@github.com> schrieb am Fr., 19. Juni 2020, 21:14:

…

Yeah, at least for the rhat and ess warnings this makes sense. But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific? Or did you want to broaden the scope of posterior to summarize sampler diagnostics? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCW2AFZJWXXTJBYJZWUV53RXO2JJANCNFSM4OC37OOQ> .

avehtari · 2020-06-21T09:59:49Z

But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific?

They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific. It's good question whether posterior package would be the place for algorithm-specific diagnostics. There can be other inference packages using NUTS / dynamic HMC, too.

How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?

Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.

jgabry · 2020-06-22T14:40:00Z

They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific.

Yeah good point.

How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?

Yeah I'd like to use posterior inside bayesplot, so I agree it would definitely be nice if posterior handled some algorithm-specific information. I think it can do this without needing to know about Stan. Taking CmdStanR as an example, both fit$draws() and fit$sampler_diagnostics() return posterior draws_array objects:

library(cmdstanr)
fit <- cmdstanr_example("schools")
str(fit$draws())
 'draws_array' num [1:1000, 1:4, 1:11] -45.5 -45.5 -44.8 -50.1 -46.9 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:11] "lp__" "mu" "tau" "theta[1]" ...

str(fit$sampler_diagnostics())
 'draws_array' num [1:1000, 1:4, 1:6] 0.6617 0.1201 0.3489 0.0429 1 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:6] "accept_stat__" "stepsize__" "treedepth__" "n_leapfrog__" ...

so posterior would just need to know how to handle draws objects containing variables like divergent__ and treedepth__, which other software besides Stan could also provide to posterior in draws format.

So maybe posterior just needs separate functions for diagnosing general MCMC and diagnosing particular algorithms, e.g.

fit$draws() %>% posterior::diagnose_mcmc()
fit$sampler_diagnostics() %>% posterior::diagnose_nuts()

which then could be wrapped up into a single method inside interfaces if desired:

fit$diagnose() # internally does the two lines above

Anyway, just brainstorming here. What do you all think?

Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.

I agree!

paul-buerkner · 2020-06-22T14:55:20Z

I agree. We should indeed handle those in posterior. One question I think we should discuss is how to name those columns. I know stan uses *__ for protected variables wheres posterior used .*. My question is whether we should align these naming conventions at least within posterior itself.

jgabry · 2020-06-22T16:28:59Z

Yeah, what if posterior allows both as input, but always uses .* internally?

jgabry · 2020-09-28T18:42:46Z

@paul-buerkner @avehtari Any further thoughts on this topic? I've been meaning to think more about this but haven't really had the time yet unfortunately. Should we try to find a time to discuss this in more detail? For better or worse, this is the kind of functionality we should try to get (close to) right the first time because the diagnostic warnings will potentially affect a lot of users/packages.

avehtari · 2020-09-28T18:44:56Z

We had a very good discussion about this with @paul-buerkner last week, and I hope he made notes afterwards. We could have a video call after Wednesday.

jgabry · 2020-09-28T18:45:34Z

Ok great, sounds good

paul-buerkner · 2020-09-28T19:47:06Z

I made notes and will post them here in some better written form tomorrow. Am Mo., 28. Sept. 2020 um 21:45 Uhr schrieb Jonah Gabry < notifications@github.com>:

…

Ok great, sounds good — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#77 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCW2AAMCYCLEDPNKOSAUSDSIDKV3ANCNFSM4OC37OOQ> .

paul-buerkner · 2020-09-29T06:11:50Z

Aki and I discussed how we can make the warning system modular to achieve the following things:

Activate and deactivate warnings selectively and globally (via options)
Change the thresholds after which warnings are issued selectively and globally

This should allow both users and package developers building on top of posterior to adjust the warning system to their needs. We could think of a function creating a special S3 object class which controls all the warnings and thresholds. This could work a little like theme_set(theme(...)) in ggplot2, where we adjust warning info and then set if globally.

There are a few questions remaining:

Where should the warnings itself live? I see two options, but there may be more.

They can be in the convergence functions themselves (rhat, bulk_ess etc.) but then it becomes hard to control warnings in a way that they are not printed for each parameter if a lot of parameters are checked.
In summarize_draws but than we have to develop a framework to detect convergence functions out of all passed summary functions and handle their output correctly. I feel that this is probably the better option but I may be overlooking something.

What level of control should we be giving users and developers over the content of the warnings?
Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say, summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.

@avehtari feel free to add more if you feel I have missed anything.

jgabry · 2020-09-29T16:59:06Z

Cool, thanks for sharing the notes.

Did you talk at all about whether or not to include HMC/NUTS-specific diagnostics? I think it would be good to include those so that all convergence warnings can be handled the same. But if we include those then how does that change the question of where the warnings should live? Right now summarize_draws() doesn't know anything about those other diagnostics like divergences and treedepth. Would we expand the input to summarize_draws() to also take in these extra sampler diagnostics? Should there be two separate functions summarize_draws() and diagnose_draws(), where the former doesn't do diagnostics at all and the latter handles all diagnostics? I don't know, just throwing out some ideas.

What level of control should we be giving users and developers over the content of the warnings?

By control over content do you mean whether they should be able to turn of certain warnings but keep other ones? Or something different?

Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say, summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.

Yeah I like the idea of explicitly passing the warning object.

paul-buerkner · 2020-10-02T11:23:13Z

I think that having two separate methods for summarize_draws() and diagnose_draws() would be good. The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable, while the evaluation in summarize_draws is per variable. In other words, a natural interface would be different I think, which favours the usage of different methods.

The per-variable diagnostics like rhat etc. can nicely be handled in summarize_draws() and I would argue it should stay that way. The main question is how to represent the HMC/NUTS diagnostics. I think that (a) we should store them in some (to-be) protected meta-columns as we do with .log_weights and (b) we need a specific method to evaluate them. This could be called diagnose_draws but I think the name would be too general, given that other diagnostics would still be evaluated in summarize_draws() . Something like diagnose_hmc_draws() or so could be specific enough I think.

In theory, we can control 3 things abouts the convergence warnings:

which warnings to show at all in principle?
what thresholds to use for the warnings which are in principle shown?
what the content of the warning messages should be?

I think we all agree that points 1 and 2 should be controlled via the warning object. The question is whether we should give users or developers control over the point 3 (warning message content)? Right now, I think we should start with some fixed warning text, but I at least wanted to mention it so that we can discuss if necessary.

avehtari · 2020-10-05T19:08:34Z

@paul-buerkner summarized the discussion well. Couple additional comments

The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable

This holds also for R* which is a generic diagnostic.

which warnings to show at all in principle?

what thresholds to use for the warnings which are in principle shown?

what the content of the warning messages should be?

I was thinking also that we have a separate function for each diagnostic check (even if the diagnostic is pre-computed) so that package developers can pick which functions to call at which point. In addition it's useful to have a function that runs a set of functions controlled via the warning object.
2+3. I was thinking that in the code we use variables for the warning thresholds and warning texts, and all the warning thresholds and texts are defined in one configuration file so that it would be easy to see all thresholds and texts and also if editing (or translating them) them there is no need to search all code files. Also this would be useful as we might want to have very short warning text and a bit more verbose one.

In addition of short and longer warning text, we might also consider for example for Rhat that a short warning just says some Rhat is too high, but an additional information would give loo khat type table of how many are exceeding different thresholds.

Also for each diagnostic check / warning we would have a link to web page with more information how to interpret the diagnostic warnings, what might be the next steps, etc.

paul-buerkner · 2021-01-07T14:52:27Z

Just a quick note from our (@jgabry; @andrewgelman; @avehtari) call today. We decided that it might be a good thing to have all detailed recommendations following warnings (e.g., divergent transitions, rhat etc) on web pages rather than in the warning messages themselves. The latter would then just point to the web pages. This way, the warning interface becomes much simpler and we can update our recommendations easier without going through CRAN submission process.

mike-lawrence · 2021-04-08T13:46:56Z

Landed here after looking for existing issues requesting a diagnose draws feature, so commenting to lend support to diagnose_hmc_draws()

jsocolar · 2021-05-19T21:02:14Z

In thinking about the content of the output of diagnose_hmc_draws(), the broad outlines of what to report about divergences and e-FMI seem pretty straightforward, but treedepth less so.

Stan's treedepth warnings are specific to however the user set max_treedepth.
If a Stan model passes all other diagnostics but hits the max_treedepth, current Stan docs suggest that this isn't necessarily a concern. And this will be especially true if the user specifies a small max_treedepth.
If the user specifies a huge max_treedepth and a huge treedepth is reached, maybe it would be a good idea to warn even if the the largest realized treedepth is lower than max_treedepth. Like say the user specifies max_treedepth = 20 and reaches treedepths of 19.
I'm not sure if it's even possible to tell from Stan csv output whether max_treedepth was exceeded (i.e. the integration was halted prematurely) versus whether max_treedepth was merely reached (and the stopping criterion was triggered during this the max_treedepth portion of the trajectory). Anybody wanna set me straight about this?

mike-lawrence · 2021-05-19T21:05:33Z

I'm not sure if it's even possible to tell from Stan csv output whether max_treedepth was exceeded (i.e. the integration was halted prematurely) versus whether max_treedepth was merely reached (and the stopping criterion was triggered during this the max_treedepth portion of the trajectory). Anybody wanna set me straight about this?

I don't think that this info is in the standard CSV output, but maybe in the diagnostic files if those are produced?

jsocolar · 2021-05-19T21:08:16Z

maybe in the diagnostic files if those are produced

But surely the idea here isn't to parse existing diagnostic files in order to... run diagnostics :P

mike-lawrence · 2021-05-19T21:13:17Z

But surely the idea here isn't to parse existing diagnostic files in order to... run diagnostics

I meant the diagnostic CSV files that are generated by cmdstan during sampling if you ask for them. They do not compute any summary diagnostics, but instead gives you the full information of the trajectories, including rejected ones if I recall. Weird, can't seem to get the formatting above to work.

mike-lawrence · 2021-05-19T21:21:15Z

I'm actually working on a workflow package that takes a different approach than is currently employed in cmdstanr and I'd already been considering generating and parsing the diagnostic CSVs by default, which would render the pertinent info available to post-sampling packages like posterior. (Though n.b. the eventual goal is to run checks during sampling to enable termination iff passed)

But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs? Or am I misremembering that it does so?

jgabry · 2021-05-19T21:51:27Z

But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs?

CmdStan's bin/diagnose doesn't need the diagnostic files, just the standard csv output files. The csv comments contain the the max depth option used (either specified by the user or the default).

I'd already been considering generating and parsing the diagnostic CSVs by default

I think it's good to have that option but it shouldn't be the default (of course it's up to you since you're doing it in your own package!). The extra diagnostic files basically double the size of what you're writing to disk, which is fine for small models but risky as a default. The information in them is also only useful if you need info about the hmc trajectories, which isn't never but it's not the case for a typical user (it's not even accessible from most interfaces).

jgabry · 2021-05-19T21:53:28Z

which isn't never but it's not the case for a typical user

Although perhaps you're not intending your aria package for the "typical" user, more for power users like yourself?

mike-lawrence · 2021-05-19T21:54:56Z

The csv comments contain the the max depth option used

But as @jsocolar notes, couldn't you reach max treedepth without actually exceeding it? Or I guess the diagnostic is about % reaching not % exceeding?

jgabry · 2021-05-19T21:56:27Z

Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.

mike-lawrence · 2021-05-19T21:58:23Z

The extra diagnostic files basically double the size of what you're writing to disk

Yup, I know; I'm also capturing all info and writing to a proper compressed binary format (netcdf4 with the InferenceData spec), so the files will be deleted when cmdstan is done with them. I'm also looking into hacks that either redirect the writing or straight delete content between cmdstan's writes, but that's longer term.

jgabry · 2021-05-19T21:59:01Z

Interesting. I'll be curious to follow the progress!

jsocolar · 2021-05-19T22:08:27Z

Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.

Just confirmed this is true. Never realized this! One implication is that if a fraction even moderately less than 1 of transitions hit the default max_treedepth of 10, it is likely that very few or none of the transitions actually got terminated prematurely, contrary to what I think most users would infer from the existing warnings.

mike-lawrence · 2021-05-19T22:13:06Z

Wonder if we could model the step sizes with truncation to estimate the CDF and thereby % >= max? Anyone know if there's any theory on the expected distribution of step sizes for well-behaved chains?

jsocolar · 2021-05-19T22:23:31Z

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.
Check this out:
https://discourse.mc-stan.org/t/issue-with-dual-averaging/5995/48

jgabry · 2021-05-19T23:01:14Z

One implication is that if a fraction even moderately less than 1 of transitions hit the default max_treedepth of 10, it is likely that very few or none of the transitions actually got terminated prematurely, contrary to what I think most users would infer from the existing warnings.

I think it could potentially be more than very few but yeah not necessarily, and I bet you're right that most users have the wrong impression because it's not explained well enough. That said, if you're constantly hitting max treedepth exactly then that's cutting it pretty close and also worth being warned about (but the warning should be clear about what it actually means).

The warning message for this in CmdStanR, for example, is

1000 of 4000 (25%) transitions hit the maximum treedepth limit of 10 or 2^10-1 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
...

So it does correctly say "hit the maximum treedepth limit", which doesn't imply it would have exceeded the limit if allowed. However, it does then mention the prematurely terminated trajectories. Even though it doesn't imply that that actually happened I could totally see it being confusing. If you have ideas for how we could express this more clearly that would be great.

Regardless, I think you're both correct that we'd need more info about the trajectories in order to distinguish between hitting max treedepth naturally and being capped at max treedepth artificially.

jgabry · 2021-05-19T23:09:11Z

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.
Check this out:
https://discourse.mc-stan.org/t/issue-with-dual-averaging/5995/48

Great post. I especially like the plot of step_size vs accept_stat with treedepth as the color. Very nice. I wonder, would it be useful to add a plot like that to bayesplot?

avehtari · 2021-05-20T06:39:37Z

I think it's not useful to store the information whether max_treedepth was reached, but it would be useful to be easily to access the number of leapfrog steps per iteration as that is informative about the computational cost and can be compared to the expected number of leapfrog steps for normal distribution with the optimal stepsize.

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.

This hints your posterior has almost constant curvature, that is, it's very close to normal

paul-buerkner added the feature New feature or request label Jun 19, 2020

rok-cesnovar mentioned this issue Jun 19, 2020

Expose diagnostic check functions stan-dev/cmdstanr#205

Closed

paul-buerkner added this to the CRAN release milestone Jun 30, 2020

jgabry mentioned this issue Jul 24, 2020

Threshold for R-hat (1.01 or 1.05) stan-dev/rstan#812

Open

nicokist mentioned this issue Feb 24, 2021

cmdstan_diagnose: Return diagnostics as numerics stan-dev/cmdstanr#459

Closed

paul-buerkner mentioned this issue Jun 24, 2021

CRAN release #165

Closed

jgabry removed this from the CRAN release milestone Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add convergence warnings #77

Add convergence warnings #77

paul-buerkner commented Jun 19, 2020

jgabry commented Jun 19, 2020

paul-buerkner commented Jun 19, 2020 via email

avehtari commented Jun 21, 2020

jgabry commented Jun 22, 2020

paul-buerkner commented Jun 22, 2020

jgabry commented Jun 22, 2020

jgabry commented Sep 28, 2020

avehtari commented Sep 28, 2020

jgabry commented Sep 28, 2020

paul-buerkner commented Sep 28, 2020 via email

paul-buerkner commented Sep 29, 2020

jgabry commented Sep 29, 2020

paul-buerkner commented Oct 2, 2020

avehtari commented Oct 5, 2020

paul-buerkner commented Jan 7, 2021

mike-lawrence commented Apr 8, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021 via email •

edited

Loading

mike-lawrence commented May 19, 2021 •

edited

Loading

jgabry commented May 19, 2021 •

edited

Loading

jgabry commented May 19, 2021

mike-lawrence commented May 19, 2021

jgabry commented May 19, 2021 •

edited

Loading

mike-lawrence commented May 19, 2021 •

edited

Loading

jgabry commented May 19, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021 •

edited

Loading

jsocolar commented May 19, 2021

jgabry commented May 19, 2021

jgabry commented May 19, 2021

avehtari commented May 20, 2021

Add convergence warnings #77

Add convergence warnings #77

Comments

paul-buerkner commented Jun 19, 2020

jgabry commented Jun 19, 2020

paul-buerkner commented Jun 19, 2020 via email

avehtari commented Jun 21, 2020

jgabry commented Jun 22, 2020

paul-buerkner commented Jun 22, 2020

jgabry commented Jun 22, 2020

jgabry commented Sep 28, 2020

avehtari commented Sep 28, 2020

jgabry commented Sep 28, 2020

paul-buerkner commented Sep 28, 2020 via email

paul-buerkner commented Sep 29, 2020

jgabry commented Sep 29, 2020

paul-buerkner commented Oct 2, 2020

avehtari commented Oct 5, 2020

paul-buerkner commented Jan 7, 2021

mike-lawrence commented Apr 8, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021 via email • edited Loading

mike-lawrence commented May 19, 2021 • edited Loading

jgabry commented May 19, 2021 • edited Loading

jgabry commented May 19, 2021

mike-lawrence commented May 19, 2021

jgabry commented May 19, 2021 • edited Loading

mike-lawrence commented May 19, 2021 • edited Loading

jgabry commented May 19, 2021

jsocolar commented May 19, 2021

mike-lawrence commented May 19, 2021 • edited Loading

jsocolar commented May 19, 2021

jgabry commented May 19, 2021

jgabry commented May 19, 2021

avehtari commented May 20, 2021

mike-lawrence commented May 19, 2021 via email •

edited

Loading

mike-lawrence commented May 19, 2021 •

edited

Loading

jgabry commented May 19, 2021 •

edited

Loading

jgabry commented May 19, 2021 •

edited

Loading

mike-lawrence commented May 19, 2021 •

edited

Loading

mike-lawrence commented May 19, 2021 •

edited

Loading