Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce mcmc_parcoord #108

Merged
merged 10 commits into from
Sep 8, 2017
Merged

Introduce mcmc_parcoord #108

merged 10 commits into from
Sep 8, 2017

Conversation

jgabry
Copy link
Member

@jgabry jgabry commented Sep 5, 2017

(Resolves #107)

This PR introduces the mcmc_parcoord function for parallel coordinates plots of MCMC draws. It can also highlight divergences in the plot if NUTS parameter info is passed to the np argument. The resulting plot is basically the plot suggested by @ahartikainen:

http://discourse.mc-stan.org/t/concentration-of-divergences/1590/21

Example usage:

library(rstan)
fit <- stan_demo("eight_schools")
draws <- as.array(fit, pars = c("tau", "theta", "lp__"))
np <- nuts_params(fit)

# default (i.e., without showing divergences)
color_scheme_set("blue")
mcmc_parcoord(draws)

parcoord1

# including divergences
color_scheme_set("darkgray")
div_style <- parcoord_style_np(div_color = "green", div_size = 0.5, div_alpha = 0.1)
mcmc_parcoord(draws, size = 0.25, alpha = 0.1, np = np, np_style = div_style)

parcoord2

@@ -55,6 +55,7 @@ export(mcmc_nuts_energy)
export(mcmc_nuts_stepsize)
export(mcmc_nuts_treedepth)
export(mcmc_pairs)
export(mcmc_parcoord)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcmc_parcoord_data function too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point. I'll add that. In the case that the user specifies the nuts
parameters, do you think it makes sense to return a single data frame that
has a column indicating if there was a divergence, or a list of two data
frames already split?

I guess that's also a general question for implementing the _data
functions. It would be nice to always return a single data frame for all of
those functions (like what you already did for ppc_intervals/ribbon) but I
wonder if we'll run into cases going forward when that's not the best
option. An alternative would be to always return a list of data frames and
sometimes (usually) that list will only have a single data frame. What do
you think? Whatever we choose it would be good to do it the same way for
all of them.

When possible, return just a dataframe. If a plot requires multiple data-frames, return them in a list. Definitely agree that we should try to be as consistent as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point. I'll add that. In the case that the user specifies the nuts
parameters, do you think it makes sense to return a single data frame that
has a column indicating if there was a divergence, or a list of two data
frames already split?

With an indicator column.

@jgabry
Copy link
Member Author

jgabry commented Sep 5, 2017 via email


has_divs <- !is.null(np)
if (has_divs) {
np <- validate_nuts_data_frame(np)
Copy link
Collaborator

@tjmahr tjmahr Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three points.

First, I recommend dplyr::pull(df, value) or getElement(df, "value) instead of .$value. I think the package check will be unhappy with the . symbol being an undefined global. If you use pull, then the namespace needs to be updated too.

Second, speaking of undefined globals, Travis has timed out during checks on this branch. Can you confirm a clean package check?

Third, here's how the function might look in the new tidy evaluation system, you can use sym() to create a symbol and then UQ() to unquote the variable containing the symbol. (Typically you would write !! x instead of UQ(x) but it's best to avoid that shortcut in logical expressions.) Learn more about it from these guides.

    if (has_divs) {
      param <- sym("Parameter")
      value <- sym("Value")
      divg <- sym("Divergent")
      
      draws$Divergent <- np %>% 
        filter(UQ(param) == "divergent__") %>%
        pull(UQ(value)) %>%
        rep(times = n_param)
      
      np <- validate_nuts_data_frame(np)
      div_draws <- filter(draws, UQ(divg) == 1)
      draws <- filter(draws, UQ(divg) == 0)
    }

This suggestion is not required. This is just a heads-up to check it out and try to get comfortable with it.

@jgabry
Copy link
Member Author

jgabry commented Sep 5, 2017

@tjmahr Thanks for the comments. Here's an update:

  • I added mcmc_parcoord_data. It returns a data frame with columns Draw (integer), Parameter (factor), Value (numeric), Divergent (numeric 0/1). I was going to name the Draw column Iteration but I didn't because all chains have already been merged so Iteration doesn't really make sense. (For other _data functions in the future it could make sense to return both Iteration and Chain instead of just Draw if the plotting function needs to keep chains separate.)

  • In the latest commits I use the new tidy eval syntax. Thanks for the reminder. I ended up changing how I was combining the parameter draws and the divergences so it's a bit different than what you suggested but I think it's correct (I'm still getting used to the new syntax so if you can take another look that'd be great).

  • I just ran R CMD CHECK locally and it passes.

@ahartikainen
Copy link

You could make an option for normalizing the data to mean 0 and sd 1 ( (x-mu)/sigma). It is needed if the two parameters are in different scales (e.g. normal(0,100), normal(150, 0.01))

@jgabry
Copy link
Member Author

jgabry commented Sep 6, 2017 via email

@jgabry
Copy link
Member Author

jgabry commented Sep 6, 2017

@ahartikainen I added an example of standardizing in 3279ebe. Thanks for the suggestion.

@jgabry jgabry mentioned this pull request Sep 7, 2017
7 tasks
Copy link
Collaborator

@tjmahr tjmahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mcmc_melt is a great idea. So an "Iteration" is an iteration in a sampling chain. And a "Draw" is sample.

@jgabry
Copy link
Member Author

jgabry commented Sep 8, 2017

So an "Iteration" is an iteration in a sampling chain. And a "Draw" is sample.

Yeah, well at least that makes sense to me! If we have a sample size of 4000 from 4 stacked chains of 1000 then it seems strange to refer the the 1001st value as the 1001st iteration since it was actually the 1st iteration in one of the chains before the chains were stacked. So yeah I think it makes sense to just use Draw. We can use Iteration if there's also a Chain column.

@jgabry jgabry merged commit 5095d68 into master Sep 8, 2017
@jgabry jgabry deleted the mcmc_parcoord branch September 8, 2017 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants