Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convergence warnings #77

Open
paul-buerkner opened this issue Jun 19, 2020 · 33 comments
Open

Add convergence warnings #77

paul-buerkner opened this issue Jun 19, 2020 · 33 comments
Labels
feature New feature or request

Comments

@paul-buerkner
Copy link
Collaborator

The current convergence warnings in our packages (e.g., in rstan and brms) are inconsistent and partly confusing to users. Examples are given in https://discourse.mc-stan.org/t/improve-warnings-for-low-ess/15355. Other examples include warnings because of rhat or ess being NA, which is most likely caused by constant parameters, which does not pose a problem if the parameters are supposed to be constant (e.g., a correlation of a variable with itself).

We should move all of these warning messages to posterior, make them consistent and improve their informativeness, by mentioning different reasons for the potential problems and pointing to more detailed doc.

@jgabry
Copy link
Member

jgabry commented Jun 19, 2020

Yeah, at least for the rhat and ess warnings this makes sense. But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific? Or did you want to broaden the scope of posterior to summarize sampler diagnostics?

@paul-buerkner
Copy link
Collaborator Author

paul-buerkner commented Jun 19, 2020 via email

@avehtari
Copy link
Collaborator

But presumably we wouldn't include the warnings about divergences, treedepth, and e-bfmi given that they are Stan-specific?

They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific. It's good question whether posterior package would be the place for algorithm-specific diagnostics. There can be other inference packages using NUTS / dynamic HMC, too.

How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?

Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.

@jgabry
Copy link
Member

jgabry commented Jun 22, 2020

They are not Stan-specific, but NUTS / dynamic HMC algorithm-specific.

Yeah good point.

How is bayesplot going forward with posterior? bayesplot has algorithm-specific diagnostic plots, but is bayesplot going to use posterior inside? If so, would it then be natural to allow (some) posterior object to carry algorithm-specific information, too? Or would it be a Stan-specific object inherited from posterior with additional Stan-specific parts?

Yeah I'd like to use posterior inside bayesplot, so I agree it would definitely be nice if posterior handled some algorithm-specific information. I think it can do this without needing to know about Stan. Taking CmdStanR as an example, both fit$draws() and fit$sampler_diagnostics() return posterior draws_array objects:

library(cmdstanr)
fit <- cmdstanr_example("schools")
str(fit$draws())
 'draws_array' num [1:1000, 1:4, 1:11] -45.5 -45.5 -44.8 -50.1 -46.9 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:11] "lp__" "mu" "tau" "theta[1]" ...

str(fit$sampler_diagnostics())
 'draws_array' num [1:1000, 1:4, 1:6] 0.6617 0.1201 0.3489 0.0429 1 ...
 - attr(*, "dimnames")=List of 3
  ..$ iteration: chr [1:1000] "1" "2" "3" "4" ...
  ..$ chain    : chr [1:4] "1" "2" "3" "4"
  ..$ variable : chr [1:6] "accept_stat__" "stepsize__" "treedepth__" "n_leapfrog__" ...

so posterior would just need to know how to handle draws objects containing variables like divergent__ and treedepth__, which other software besides Stan could also provide to posterior in draws format.

So maybe posterior just needs separate functions for diagnosing general MCMC and diagnosing particular algorithms, e.g.

fit$draws() %>% posterior::diagnose_mcmc()
fit$sampler_diagnostics() %>% posterior::diagnose_nuts()

which then could be wrapped up into a single method inside interfaces if desired:

fit$diagnose() # internally does the two lines above

Anyway, just brainstorming here. What do you all think?

Also the algorithm-specific diagnostics should be improved, too. Again, there is the question, where would be the best place for advise, e.g., what to do if any divergences are observed.

I agree!

@paul-buerkner
Copy link
Collaborator Author

I agree. We should indeed handle those in posterior. One question I think we should discuss is how to name those columns. I know stan uses *__ for protected variables wheres posterior used .*. My question is whether we should align these naming conventions at least within posterior itself.

@jgabry
Copy link
Member

jgabry commented Jun 22, 2020

Yeah, what if posterior allows both as input, but always uses .* internally?

@jgabry
Copy link
Member

jgabry commented Sep 28, 2020

@paul-buerkner @avehtari Any further thoughts on this topic? I've been meaning to think more about this but haven't really had the time yet unfortunately. Should we try to find a time to discuss this in more detail? For better or worse, this is the kind of functionality we should try to get (close to) right the first time because the diagnostic warnings will potentially affect a lot of users/packages.

@avehtari
Copy link
Collaborator

We had a very good discussion about this with @paul-buerkner last week, and I hope he made notes afterwards. We could have a video call after Wednesday.

@jgabry
Copy link
Member

jgabry commented Sep 28, 2020

Ok great, sounds good

@paul-buerkner
Copy link
Collaborator Author

paul-buerkner commented Sep 28, 2020 via email

@paul-buerkner
Copy link
Collaborator Author

Aki and I discussed how we can make the warning system modular to achieve the following things:

  • Activate and deactivate warnings selectively and globally (via options)
  • Change the thresholds after which warnings are issued selectively and globally

This should allow both users and package developers building on top of posterior to adjust the warning system to their needs. We could think of a function creating a special S3 object class which controls all the warnings and thresholds. This could work a little like theme_set(theme(...)) in ggplot2, where we adjust warning info and then set if globally.

There are a few questions remaining:

  • Where should the warnings itself live? I see two options, but there may be more.
  1. They can be in the convergence functions themselves (rhat, bulk_ess etc.) but then it becomes hard to control warnings in a way that they are not printed for each parameter if a lot of parameters are checked.
  2. In summarize_draws but than we have to develop a framework to detect convergence functions out of all passed summary functions and handle their output correctly. I feel that this is probably the better option but I may be overlooking something.
  • What level of control should we be giving users and developers over the content of the warnings?

  • Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say, summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.

@avehtari feel free to add more if you feel I have missed anything.

@jgabry
Copy link
Member

jgabry commented Sep 29, 2020

Cool, thanks for sharing the notes.

Did you talk at all about whether or not to include HMC/NUTS-specific diagnostics? I think it would be good to include those so that all convergence warnings can be handled the same. But if we include those then how does that change the question of where the warnings should live? Right now summarize_draws() doesn't know anything about those other diagnostics like divergences and treedepth. Would we expand the input to summarize_draws() to also take in these extra sampler diagnostics? Should there be two separate functions summarize_draws() and diagnose_draws(), where the former doesn't do diagnostics at all and the latter handles all diagnostics? I don't know, just throwing out some ideas.

  • What level of control should we be giving users and developers over the content of the warnings?

By control over content do you mean whether they should be able to turn of certain warnings but keep other ones? Or something different?

  • Should there only be an option to adjust warnings globally or should the warning 'theme' object also being passable to, say, summarize_draws. From a developing perspective, explicitely passing the warning object at least as one options seems like the preferrable approach to me.

Yeah I like the idea of explicitly passing the warning object.

@paul-buerkner
Copy link
Collaborator Author

I think that having two separate methods for summarize_draws() and diagnose_draws() would be good. The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable, while the evaluation in summarize_draws is per variable. In other words, a natural interface would be different I think, which favours the usage of different methods.

The per-variable diagnostics like rhat etc. can nicely be handled in summarize_draws() and I would argue it should stay that way. The main question is how to represent the HMC/NUTS diagnostics. I think that (a) we should store them in some (to-be) protected meta-columns as we do with .log_weights and (b) we need a specific method to evaluate them. This could be called diagnose_draws but I think the name would be too general, given that other diagnostics would still be evaluated in summarize_draws() . Something like diagnose_hmc_draws() or so could be specific enough I think.

In theory, we can control 3 things abouts the convergence warnings:

  • which warnings to show at all in principle?
  • what thresholds to use for the warnings which are in principle shown?
  • what the content of the warning messages should be?

I think we all agree that points 1 and 2 should be controlled via the warning object. The question is whether we should give users or developers control over the point 3 (warning message content)? Right now, I think we should start with some fixed warning text, but I at least wanted to mention it so that we can discuss if necessary.

@avehtari
Copy link
Collaborator

avehtari commented Oct 5, 2020

@paul-buerkner summarized the discussion well. Couple additional comments

The main reason is that for HMC/NUTS diagnostics, we have a per-draw info such as divergences etc. not related to any particular variable

This holds also for R* which is a generic diagnostic.

  1. which warnings to show at all in principle?
  2. what thresholds to use for the warnings which are in principle shown?
  3. what the content of the warning messages should be?
  1. I was thinking also that we have a separate function for each diagnostic check (even if the diagnostic is pre-computed) so that package developers can pick which functions to call at which point. In addition it's useful to have a function that runs a set of functions controlled via the warning object.
    2+3. I was thinking that in the code we use variables for the warning thresholds and warning texts, and all the warning thresholds and texts are defined in one configuration file so that it would be easy to see all thresholds and texts and also if editing (or translating them) them there is no need to search all code files. Also this would be useful as we might want to have very short warning text and a bit more verbose one.

In addition of short and longer warning text, we might also consider for example for Rhat that a short warning just says some Rhat is too high, but an additional information would give loo khat type table of how many are exceeding different thresholds.

Also for each diagnostic check / warning we would have a link to web page with more information how to interpret the diagnostic warnings, what might be the next steps, etc.

@paul-buerkner
Copy link
Collaborator Author

Just a quick note from our (@jgabry; @andrewgelman; @avehtari) call today. We decided that it might be a good thing to have all detailed recommendations following warnings (e.g., divergent transitions, rhat etc) on web pages rather than in the warning messages themselves. The latter would then just point to the web pages. This way, the warning interface becomes much simpler and we can update our recommendations easier without going through CRAN submission process.

@mike-lawrence
Copy link

Landed here after looking for existing issues requesting a diagnose draws feature, so commenting to lend support to diagnose_hmc_draws()

@jsocolar
Copy link
Contributor

In thinking about the content of the output of diagnose_hmc_draws(), the broad outlines of what to report about divergences and e-FMI seem pretty straightforward, but treedepth less so.

  • Stan's treedepth warnings are specific to however the user set max_treedepth.
  • If a Stan model passes all other diagnostics but hits the max_treedepth, current Stan docs suggest that this isn't necessarily a concern. And this will be especially true if the user specifies a small max_treedepth.
  • If the user specifies a huge max_treedepth and a huge treedepth is reached, maybe it would be a good idea to warn even if the the largest realized treedepth is lower than max_treedepth. Like say the user specifies max_treedepth = 20 and reaches treedepths of 19.
  • I'm not sure if it's even possible to tell from Stan csv output whether max_treedepth was exceeded (i.e. the integration was halted prematurely) versus whether max_treedepth was merely reached (and the stopping criterion was triggered during this the max_treedepth portion of the trajectory). Anybody wanna set me straight about this?

@mike-lawrence
Copy link

I'm not sure if it's even possible to tell from Stan csv output whether max_treedepth was exceeded (i.e. the integration was halted prematurely) versus whether max_treedepth was merely reached (and the stopping criterion was triggered during this the max_treedepth portion of the trajectory). Anybody wanna set me straight about this?

I don't think that this info is in the standard CSV output, but maybe in the diagnostic files if those are produced?

@jsocolar
Copy link
Contributor

maybe in the diagnostic files if those are produced

But surely the idea here isn't to parse existing diagnostic files in order to... run diagnostics :P

@mike-lawrence
Copy link

mike-lawrence commented May 19, 2021 via email

@mike-lawrence
Copy link

mike-lawrence commented May 19, 2021

I'm actually working on a workflow package that takes a different approach than is currently employed in cmdstanr and I'd already been considering generating and parsing the diagnostic CSVs by default, which would render the pertinent info available to post-sampling packages like posterior. (Though n.b. the eventual goal is to run checks during sampling to enable termination iff passed)

But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs? Or am I misremembering that it does so?

@jgabry
Copy link
Member

jgabry commented May 19, 2021

But another thought is: how does cmdstan's diagnose binary produce treedepth warnings without the diagnostic CSVs?

CmdStan's bin/diagnose doesn't need the diagnostic files, just the standard csv output files. The csv comments contain the the max depth option used (either specified by the user or the default).

I'd already been considering generating and parsing the diagnostic CSVs by default

I think it's good to have that option but it shouldn't be the default (of course it's up to you since you're doing it in your own package!). The extra diagnostic files basically double the size of what you're writing to disk, which is fine for small models but risky as a default. The information in them is also only useful if you need info about the hmc trajectories, which isn't never but it's not the case for a typical user (it's not even accessible from most interfaces).

@jgabry
Copy link
Member

jgabry commented May 19, 2021

which isn't never but it's not the case for a typical user

Although perhaps you're not intending your aria package for the "typical" user, more for power users like yourself?

@mike-lawrence
Copy link

The csv comments contain the the max depth option used

But as @jsocolar notes, couldn't you reach max treedepth without actually exceeding it? Or I guess the diagnostic is about % reaching not % exceeding?

@jgabry
Copy link
Member

jgabry commented May 19, 2021

Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.

@mike-lawrence
Copy link

mike-lawrence commented May 19, 2021

The extra diagnostic files basically double the size of what you're writing to disk

Yup, I know; I'm also capturing all info and writing to a proper compressed binary format (netcdf4 with the InferenceData spec), so the files will be deleted when cmdstan is done with them. I'm also looking into hacks that either redirect the writing or straight delete content between cmdstan's writes, but that's longer term.

@jgabry
Copy link
Member

jgabry commented May 19, 2021

Interesting. I'll be curious to follow the progress!

@jsocolar
Copy link
Contributor

Yeah I think it's % reaching, but it's been a while since I checked so I could be wrong.

Just confirmed this is true. Never realized this! One implication is that if a fraction even moderately less than 1 of transitions hit the default max_treedepth of 10, it is likely that very few or none of the transitions actually got terminated prematurely, contrary to what I think most users would infer from the existing warnings.

@mike-lawrence
Copy link

mike-lawrence commented May 19, 2021

Wonder if we could model the step sizes with truncation to estimate the CDF and thereby % >= max? Anyone know if there's any theory on the expected distribution of step sizes for well-behaved chains?

@jsocolar
Copy link
Contributor

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.
Check this out:
https://discourse.mc-stan.org/t/issue-with-dual-averaging/5995/48

@jgabry
Copy link
Member

jgabry commented May 19, 2021

One implication is that if a fraction even moderately less than 1 of transitions hit the default max_treedepth of 10, it is likely that very few or none of the transitions actually got terminated prematurely, contrary to what I think most users would infer from the existing warnings.

I think it could potentially be more than very few but yeah not necessarily, and I bet you're right that most users have the wrong impression because it's not explained well enough. That said, if you're constantly hitting max treedepth exactly then that's cutting it pretty close and also worth being warned about (but the warning should be clear about what it actually means).

The warning message for this in CmdStanR, for example, is

1000 of 4000 (25%) transitions hit the maximum treedepth limit of 10 or 2^10-1 leapfrog steps.
Trajectories that are prematurely terminated due to this limit will result in slow exploration.
...

So it does correctly say "hit the maximum treedepth limit", which doesn't imply it would have exceeded the limit if allowed. However, it does then mention the prematurely terminated trajectories. Even though it doesn't imply that that actually happened I could totally see it being confusing. If you have ideas for how we could express this more clearly that would be great.

Regardless, I think you're both correct that we'd need more info about the trajectories in order to distinguish between hitting max treedepth naturally and being capped at max treedepth artificially.

@jgabry
Copy link
Member

jgabry commented May 19, 2021

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.
Check this out:
https://discourse.mc-stan.org/t/issue-with-dual-averaging/5995/48

Great post. I especially like the plot of step_size vs accept_stat with treedepth as the color. Very nice. I wonder, would it be useful to add a plot like that to bayesplot?

@avehtari
Copy link
Collaborator

I think it's not useful to store the information whether max_treedepth was reached, but it would be useful to be easily to access the number of leapfrog steps per iteration as that is informative about the computational cost and can be compared to the expected number of leapfrog steps for normal distribution with the optimal stepsize.

For some of my models with a lot of parameters, treedepth has almost no variability, even if the step-size is right near the border between two treedepths.

This hints your posterior has almost constant curvature, that is, it's very close to normal

@jgabry jgabry removed this from the CRAN release milestone Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants