Introduce mcmc_parcoord #108

jgabry · 2017-09-05T17:18:31Z

(Resolves #107)

This PR introduces the mcmc_parcoord function for parallel coordinates plots of MCMC draws. It can also highlight divergences in the plot if NUTS parameter info is passed to the np argument. The resulting plot is basically the plot suggested by @ahartikainen:

http://discourse.mc-stan.org/t/concentration-of-divergences/1590/21

Example usage:

library(rstan)
fit <- stan_demo("eight_schools")
draws <- as.array(fit, pars = c("tau", "theta", "lp__"))
np <- nuts_params(fit)

# default (i.e., without showing divergences)
color_scheme_set("blue")
mcmc_parcoord(draws)

# including divergences
color_scheme_set("darkgray")
div_style <- parcoord_style_np(div_color = "green", div_size = 0.5, div_alpha = 0.1)
mcmc_parcoord(draws, size = 0.25, alpha = 0.1, np = np, np_style = div_style)

[ci skip]

tjmahr · 2017-09-05T19:38:21Z

NAMESPACE

@@ -55,6 +55,7 @@ export(mcmc_nuts_energy)
 export(mcmc_nuts_stepsize)
 export(mcmc_nuts_treedepth)
 export(mcmc_pairs)
+export(mcmc_parcoord)


mcmc_parcoord_data function too?

Yes good point. I'll add that. In the case that the user specifies the nuts
parameters, do you think it makes sense to return a single data frame that
has a column indicating if there was a divergence, or a list of two data
frames already split?

I guess that's also a general question for implementing the _data
functions. It would be nice to always return a single data frame for all of
those functions (like what you already did for ppc_intervals/ribbon) but I
wonder if we'll run into cases going forward when that's not the best
option. An alternative would be to always return a list of data frames and
sometimes (usually) that list will only have a single data frame. What do
you think? Whatever we choose it would be good to do it the same way for
all of them.

When possible, return just a dataframe. If a plot requires multiple data-frames, return them in a list. Definitely agree that we should try to be as consistent as possible.

Yes good point. I'll add that. In the case that the user specifies the nuts
parameters, do you think it makes sense to return a single data frame that
has a column indicating if there was a divergence, or a list of two data
frames already split?

With an indicator column.

jgabry · 2017-09-05T19:48:54Z

Yes good point. I'll add that. In the case that the user specifies the nuts parameters, do you think it makes sense to return a single data frame that has a column indicating if there was a divergence, or a list of two data frames already split? I guess that's also a general question for implementing the `_data` functions. It would be nice to always return a single data frame for all of those functions (like what you already did for ppc_intervals/ribbon) but I wonder if we'll run into cases going forward when that's not the best option. An alternative would be to always return a list of data frames and sometimes (usually) that list will only have a single data frame. What do you think? Whatever we choose it would be good to do it the same way for all of them.

…

On Tue, Sep 5, 2017 at 3:38 PM TJ Mahr ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In NAMESPACE <#108 (comment)>: > @@ -55,6 +55,7 @@ export(mcmc_nuts_energy) export(mcmc_nuts_stepsize) export(mcmc_nuts_treedepth) export(mcmc_pairs) +export(mcmc_parcoord) mcmc_parcoord_data function too? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#108 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHb4Q4OhODe4yV3AUquSBbJUIVOfboZKks5sfaMugaJpZM4PNQyO> .

tjmahr · 2017-09-05T19:49:18Z

R/mcmc-parcoord.R

+
+    has_divs <- !is.null(np)
+    if (has_divs) {
+      np <- validate_nuts_data_frame(np)


Three points.

First, I recommend dplyr::pull(df, value) or getElement(df, "value) instead of .$value. I think the package check will be unhappy with the . symbol being an undefined global. If you use pull, then the namespace needs to be updated too.

Second, speaking of undefined globals, Travis has timed out during checks on this branch. Can you confirm a clean package check?

Third, here's how the function might look in the new tidy evaluation system, you can use sym() to create a symbol and then UQ() to unquote the variable containing the symbol. (Typically you would write !! x instead of UQ(x) but it's best to avoid that shortcut in logical expressions.) Learn more about it from these guides.

if (has_divs) { param <- sym("Parameter") value <- sym("Value") divg <- sym("Divergent") draws$Divergent <- np %>% filter(UQ(param) == "divergent__") %>% pull(UQ(value)) %>% rep(times = n_param) np <- validate_nuts_data_frame(np) div_draws <- filter(draws, UQ(divg) == 1) draws <- filter(draws, UQ(divg) == 0) }

This suggestion is not required. This is just a heads-up to check it out and try to get comfortable with it.

jgabry · 2017-09-05T22:12:57Z

@tjmahr Thanks for the comments. Here's an update:

I added mcmc_parcoord_data. It returns a data frame with columns Draw (integer), Parameter (factor), Value (numeric), Divergent (numeric 0/1). I was going to name the Draw column Iteration but I didn't because all chains have already been merged so Iteration doesn't really make sense. (For other _data functions in the future it could make sense to return both Iteration and Chain instead of just Draw if the plotting function needs to keep chains separate.)
In the latest commits I use the new tidy eval syntax. Thanks for the reminder. I ended up changing how I was combining the parameter draws and the divergences so it's a bit different than what you suggested but I think it's correct (I'm still getting used to the new syntax so if you can take another look that'd be great).
I just ran R CMD CHECK locally and it passes.

ahartikainen · 2017-09-06T11:02:58Z

You could make an option for normalizing the data to mean 0 and sd 1 ( (x-mu)/sigma). It is needed if the two parameters are in different scales (e.g. normal(0,100), normal(150, 0.01))

jgabry · 2017-09-06T14:47:35Z

I think this can be done using the existing transformations argument that all the mcmc_ plotting functions have. That is, mcmc_parcoord(x, np = np, transformations = function(x) (x - mean(x))/sd(x)) I should add an example to the Examples section of the doc. Alternatively, there could be an mcmc_parcoord_scaled function that does this automatically. Might be convenient.

…

On Wed, Sep 6, 2017 at 7:03 AM Ari Hartikainen ***@***.***> wrote: You could make an option for normalizing the data to mean 0 and sd 1 ( (x-mu)/sigma). It is needed if the two parameters are in different scales (e.g. normal(0,100), normal(150, 0.01)) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#108 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHb4Q_RlDbtJ9bENLGwC9SKooU_IybxMks5sfnvjgaJpZM4PNQyO> .

[ci skip]

jgabry · 2017-09-06T21:40:20Z

@ahartikainen I added an example of standardizing in 3279ebe. Thanks for the suggestion.

tjmahr

mcmc_melt is a great idea. So an "Iteration" is an iteration in a sampling chain. And a "Draw" is sample.

jgabry · 2017-09-08T15:17:24Z

So an "Iteration" is an iteration in a sampling chain. And a "Draw" is sample.

Yeah, well at least that makes sense to me! If we have a sample size of 4000 from 4 stacked chains of 1000 then it seems strange to refer the the 1001st value as the 1001st iteration since it was actually the 1st iteration in one of the chains before the chains were stacked. So yeah I think it makes sense to just use Draw. We can use Iteration if there's also a Chain column.

jgabry added 5 commits September 3, 2017 19:59

Parallel coordinates plot, optionally with divergences

f956a4a

div_color arg for mcmc_trace

d753e1c

Chang class to nuts_style

10ae68a

More work on parcoord plot

6377633

[ci skip]

Add plot description section

479b557

[ci skip]

jgabry added feature new plot labels Sep 5, 2017

Add mcmc_parcoord example to diagnostics vignette

b1e47d5

[ci skip]

tjmahr reviewed Sep 5, 2017

View reviewed changes

tjmahr requested changes Sep 5, 2017

View reviewed changes

mcmc_parcoord_data + TJ's suggestions

66b5b6b

Use new melt_mcmc helper function

06ee061

Add example of using transformation

3279ebe

[ci skip]

jgabry mentioned this pull request Sep 7, 2017

Next CRAN release #109

Closed

7 tasks

Merge branch 'master' into mcmc_parcoord

f2f89bf

tjmahr reviewed Sep 7, 2017

View reviewed changes

tjmahr approved these changes Sep 7, 2017

View reviewed changes

jgabry merged commit 5095d68 into master Sep 8, 2017

jgabry deleted the mcmc_parcoord branch September 8, 2017 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce mcmc_parcoord #108

Introduce mcmc_parcoord #108

jgabry commented Sep 5, 2017

tjmahr Sep 5, 2017

tjmahr Sep 5, 2017

tjmahr Sep 5, 2017

jgabry commented Sep 5, 2017 via email

tjmahr Sep 5, 2017 •

edited

Loading

jgabry commented Sep 5, 2017

ahartikainen commented Sep 6, 2017

jgabry commented Sep 6, 2017 via email

jgabry commented Sep 6, 2017

tjmahr left a comment

jgabry commented Sep 8, 2017

Introduce mcmc_parcoord #108

Introduce mcmc_parcoord #108

Conversation

jgabry commented Sep 5, 2017

Example usage:

tjmahr Sep 5, 2017

Choose a reason for hiding this comment

tjmahr Sep 5, 2017

Choose a reason for hiding this comment

tjmahr Sep 5, 2017

Choose a reason for hiding this comment

jgabry commented Sep 5, 2017 via email

tjmahr Sep 5, 2017 • edited Loading

Choose a reason for hiding this comment

jgabry commented Sep 5, 2017

ahartikainen commented Sep 6, 2017

jgabry commented Sep 6, 2017 via email

jgabry commented Sep 6, 2017

tjmahr left a comment

Choose a reason for hiding this comment

jgabry commented Sep 8, 2017

tjmahr Sep 5, 2017 •

edited

Loading