Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots for model predictions without y variable #151

Closed
wds15 opened this issue May 14, 2018 · 23 comments · Fixed by #222
Closed

plots for model predictions without y variable #151

wds15 opened this issue May 14, 2018 · 23 comments · Fixed by #222

Comments

@wds15
Copy link

wds15 commented May 14, 2018

ppc_* are great functions for doing model checking on the data fitted. However, it would be great to also allow these function to visualize predictions in absence of any data.

Possibilites to achieve this would be

  • make the y argument optional in these functions
  • expand the mcmc_* functions with things like the x argument part of a number of ppc function (as we usually do predictions as a function of some variable x)

Generally, the x argument in the ppc functions is great and it could be available in more places.

@jgabry
Copy link
Member

jgabry commented May 14, 2018

I’ve been wanting to do this for a while. Thanks for opening the issue. If y is not included then the prefix ppc_ isn’t quite appropriate anymore. pp_ I think makes more sense. @tjmahr maybe we could have pp_ versions that don’t require y and then have the ppc_ functions be wrappers for those that also require y and have additional y-specific arguments when necessary? It would be easier to just not require y in the existing functions, but it’s really not a ppc anymore in that case, it’s just a visualization of the posterior predictive distribution.

@wds15
Copy link
Author

wds15 commented May 15, 2018

Introducing pp_ functions sounds like a clean way of implementing it... at the cost of having more exposed functions. This is a decision maintainers need to make, but as a user that would make me very happy.

jgabry added a commit that referenced this issue May 8, 2019
@jgabry
Copy link
Member

jgabry commented May 22, 2019

This is the next major thing I want to work on getting into bayesplot. I think the prefix ppd_ makes sense for plots just of the posterior (or prior) predictive distribution.

The ppd functions will share most of the same code as the existing ppc functions so both should rely on the same internals wherever possible.

@jgabry jgabry changed the title extend ppc functions to plots for model predictions plots for model predictions without y variable May 22, 2019
@jgabry
Copy link
Member

jgabry commented May 23, 2019

@tjmahr I can work on the implementations of these, but just want to check that you’re ok with the plan (or if you have suggestions or a totally different proposal). Here’s an example function signature for these ppd functions:

ppd_dens_overlay(ypred, ..., same_args)
  • prefix is ppd instead of ppc
  • argument name is ypred
  • same_args won’t actually be there, we’ll have the same arguments as the corresponding ppc function but without the y and yrep

@tjmahr
Copy link
Collaborator

tjmahr commented May 24, 2019

So, here's the implementation of ppc_dens_overlay()

  check_ignored_arguments(...)
  data <- ppc_data(y, yrep)

  ggplot(data) +
    aes_(x = ~ value) +
    stat_density(
      aes_(group = ~ rep_id, color = "yrep"),
      data = function(x) dplyr::filter(x, !.data$is_y),
      geom = "line",
      position = "identity",
      size = size,
      alpha = alpha,
      trim = trim,
      bw = bw,
      adjust = adjust,
      kernel = kernel,
      n = n_dens
    ) +
    stat_density(
      aes_(color = "y"),
      data = function(x) dplyr::filter(x, .data$is_y),
      geom = "line",
      position = "identity",
      lineend = "round",
      size = 1,
      trim = trim,
      bw = bw,
      adjust = adjust,
      kernel = kernel,
      n = n_dens
    ) +
    scale_color_ppc_dist() +
    bayesplot_theme_get() +
    xlab(y_label()) +
    dont_expand_axes() +
    yaxis_title(FALSE) +
    xaxis_title(FALSE) +
    yaxis_text(FALSE) +
    yaxis_ticks(FALSE)

To implement this, we want to

  • have a ppd_data() function create the tidy dataframe
  • remove the second stat_density() layer
  • change the colors and labels to use something like y_pred instead of y_rep.

Is that basically what you had in mind? Following the format of the other functions?

I don't want us to be too clever and have functions that switch between ppd_ and ppc_ styles based on whether y is NULL. I know the temptation is there but I worry it make things harder to change down the road.

@jgabry
Copy link
Member

jgabry commented May 24, 2019

Yeah basically just what you said is what I was planning.

But something to consider: the existing ppc_ functions can be thought of as just modifications to a ppd_ plot, adding stuff about y (and changing ypred label to yrep). So at least in theory it would make sense to have, for example, ppc_dens_overlay() first call ppd_dens_overlay() and then add the y stuff to the plot.

So conceptually something like the following would be really elegant:

# pseudocode (we don't actually have overlaid_densities() and add_y_density())
ppd_dens_overlay <- function(ypred, ...) {
   data <- ppd_data(ypred, ...)
   ggplot(data, ...) + 
     overlaid_densities(...)
}
ppc_dens_overlay <- function (y, yrep, ...) {
  ppd_dens_overlay(ypred = yrep, ...) + 
    add_y_density(y)
}

But it could be that it makes more practical sense to do this (again this is some hybrid of real and pseudocode for the moment):

# don't call ppd_dens_overlay, just call the same internal overlaid_densities() helper
ppc_dens_overlay <- function(y, yrep, ...) {
  data <- ppc_data(y, yrep, ...)
  ggplot(filter(data, !is_y)) + 
     overlaid_densities(...) + 
     add_y_density(filter(data, is_y))
}

@jgabry
Copy link
Member

jgabry commented May 24, 2019

I don't want us to be too clever and have functions that switch between ppd_ and ppc_ styles based on whether y is NULL. I know the temptation is there but I worry it make things harder to change down the road.

I think I totally agree with this and my goal is definitely to do this in a way that makes it as easy as possible to maintain everything going forward, although that may require some non-trivial refactoring.

But just so I'm clear on what you mean, what would be the issue with something like the following?

# exported
ppd_data <- function(ypred, ...) {
  .ppd_data(predictions = ypred, observations = NULL, ...)
}
# exported
ppc_data <- function(y, yrep, ...) {
  .ppd_data(predictions = yrep, observations = y, ...)
}

# internal
.ppd_data <- function(predictions, observations = NULL, ...) {
  # if observations is NULL don't include `y` stuff in the returned object
}

@tjmahr
Copy link
Collaborator

tjmahr commented May 24, 2019

That seems fine at the data level. It's problem of creating six-different plots from an internal plotting function (like .mcmc_trace()... I don't know if it's really six. Bear with me.) that I worry about. All of those plots are really tightly coupled and it's hard to work through. In comparison, it's really clear what does what in the ppc_dens_overlay() code. I think using helper functions, scales and geoms can help achieve consistency without over-coupling.

@jgabry
Copy link
Member

jgabry commented May 24, 2019

Ok I see what you mean, thanks. I agree about avoiding things like what .mcmc_trace currently does. It’s probably a good idea to refactor and break that apart at some point to make it easier to work with.

@jgabry jgabry added this to In progress in PPD module May 25, 2019
jgabry added a commit that referenced this issue May 25, 2019
@jgabry
Copy link
Member

jgabry commented May 26, 2019

The internal .ppc_intervals is similar to .mcmc_trace in that it responsible for way too many plots. I’m going to get rid of it as part of this process. I’ve started breaking it up into helper functions that can be shared by the relevant individual ppc and ppd functions.

@jgabry
Copy link
Member

jgabry commented May 28, 2019

After looking into it, for some ppc functions it would work to call the ppd function inside the ppc function and then add the y stuff, but for others it’s either not possible or would require starting the implementation from scratch. For
consistency,I will just rely on shared helper functions and never call ppd inside a ppc.

However, I realized there’s a different opportunity for reducing code duplication. Without too much work, for both ppc and ppd plots we can have the the _grouped functions call their ungrouped counterpart and then add the facet layer.

ppc_ribbon_grouped <- function(y, yrep, x, group, ... facet_args, whatever...) {
call <- ungroup_call(match.call(expand.dots=FALSE)
eval(call) + add_facet_layer(facet_args)
}

So basically each _grouped function can be reduced to a few lines of code, which seems preferable to copying everything when the only difference is adding facets. The only change to the ungrouped function would be to have it allow the group argument via ... so it can create the grouped data frame when called internally by
the _grouped function.

It kind of sounds complicated in words but the code is really clean and simple.

Anyway, not saying we have to do it this way, just exploring the possibility.

——

ungroup_call() is a small helper function:

ungroup_call <- function(call) {

@jgabry jgabry closed this as completed May 28, 2019
@jgabry
Copy link
Member

jgabry commented May 28, 2019

Didn’t mean to close this. Reopening.

@jgabry jgabry reopened this May 28, 2019
@tjmahr
Copy link
Collaborator

tjmahr commented May 29, 2019

ungroup_call <- function(fn, call) {
  args <- rlang::call_args(call)
  args$called_from_internal <- TRUE
  args$... <- NULL
  rlang::call2(.fn = fn, !!!args, .ns = "bayesplot")
}

This is pretty tricky stuff. Not a judgment, just an observation.

@jgabry
Copy link
Member

jgabry commented May 29, 2019

Do you mean constructing calls in general is tricky or something particular about this use case? I'm not attached to doing it this way, just trying it out. We can drop the ungroup_call stuff and just to do something like this, which is much clearer:

ppc_intervals_grouped <- function(y, yrep, x = NULL, group = NULL, ..., facet_args = list(), other args) {
  g <- ppc_intervals(y, yrep, x, group, called_from_internal = TRUE)
  g + intervals_group_layer(facet_args)
}

That still requires ppc_intervals() to know it's being called from ppc_intervals_grouped() but avoids creating a call object. How do you feel about that?

@tjmahr
Copy link
Collaborator

tjmahr commented May 29, 2019

Hmm, I take it back.

My first hunch would be something like this,

f <- function(x, group, ...) {
  g(x, ...)
}

g <- function(x, ...) {
  print(x)
}

f(1, "a", "things") 

But that requires passing along all the arguments (except group) to g() which is tedious and gives you something else to update if the arguments to f() or g() change. Your way seems like the smarter, more automatic way to go about the problem.

@jgabry
Copy link
Member

jgabry commented May 29, 2019

Ok I’ll proceed that way for now

@jgabry
Copy link
Member

jgabry commented May 29, 2019

On another note, something else I’m doing as a part of this (because it makes sense to do at the same time as the ppd stuff) is adding all the ppc_*_data() functions that we are still missing.

@bnicenboim
Copy link
Contributor

Hi, I also wanted to make new predictions, or prior predictive checks with the ppc_ functions, and I got stuck having to have y that are not NAs.
One super simple solution was to just change the internal function validate_y so that it accepts NA. Maybe I'm being too naive, but why is this a problem?

@tjmahr
Copy link
Collaborator

tjmahr commented Dec 12, 2019

@jgabry do you need help for this on the https://github.com/stan-dev/bayesplot/compare/ppd-functions branch?

@jgabry
Copy link
Member

jgabry commented Dec 20, 2019

I need to get back to working on this! @tjmahr If you have time and want to help that branch move forwards that would be great!

@jgabry
Copy link
Member

jgabry commented Jun 16, 2020

@tjmahr So I'm finally getting back to working on this! I just fixed a bunch of merge conflicts and cleaned up some stuff. Currently this branch has added all the PPD versions for the "distributions", "intervals", and "test-statistics" categories (and also a bunch of missing _data() functions). I wonder if it makes sense for us to try to get this merged soon, and then the remaining PPD categories can be added after. What do you think?

@tjmahr
Copy link
Collaborator

tjmahr commented Jun 16, 2020

Sounds good.

@jgabry
Copy link
Member

jgabry commented Jun 16, 2020

Ok I'll get a PR ready. It's going to be a massive one (sorry!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
PPD module
  
In progress
Development

Successfully merging a pull request may close this issue.

4 participants