New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor visualization #178

Closed
echasnovski opened this Issue Aug 15, 2018 · 25 comments

Comments

Projects
None yet
4 participants
@echasnovski
Copy link
Collaborator

echasnovski commented Aug 15, 2018

This is somewhat a continuation of #173. Even after DRYing visualize() in #177 I think it can be further refactored to be more straightforward. In my understanding, the visualizaion algorithm should be as follows:

  1. Check inputs for type and reasonability correctness. This is a step that shouldn't repair any inputs but only throw errors, warnings, and messages. It includes type checks and attention warnings.
  2. Initialize the plot. It can be done with either ggplot() or ggplot(data) based on the later design (probably second one is better).
  3. Add simulation layer:
    • If method = "simulation" then this is a geom_histogram() or geom_bar() of stat column (choice is based on number of its unique elements).
    • If method = "theoretical" then no layer is added.
    • If method = "both" then this is a geom_histogram() of stat column and aes(y = ..density..).
  4. Add theoretical layer:
    • If method = "simulation" then no layer is added.
    • If method = "theoretical" then output of stat_function() is returned in the range between [0.001, 0.999] quantiles of theoretical distribution from attr(data, "theory_type").
    • If method = "both" then output of stat_function() is returned based on the range of stat column (as here).
  5. Add external statistics. It seems that current develop and master versions of the code is intended to prohibit simultaneous usage of observed statistic obs_stat and interval data endpoints. For now it has some flaws as just supplying them both without shading gives a warning "Values for both endpoints and obs_stat were given when only one should be set. Ignoring obs_stat values." and then plots both objects with vertical lines. For me personally it was a surprise that I can't plot both of them, but I somewhat understand the motivation of not mixing p-values with confidence intervals. I strongly suggest to change their plotting in one of the following ways:
    1. Treat value and shading as a whole. Most declared features will be present. If both obs_stat and end_points are present, give a warning and use only endpoints (no vertical line and shading is done for obs_stat).
    2. Create new function to add external statistics to ggplot object. It can be of the form add_external_stat(stat1, stat2 = NULL, line_col = "red2", direction = NULL, shadow_fill = "pink"). If one statistic is supplied, plot as currently with p-value. If two - as with endpoints. The biggest benefit of this approach is that visualize() will have far less arguments and be easier to use. The downside is the need to manually change the color if both p-value and CI data are plotted.
    3. Create two separate functions for adding p-value and CI data to plot. I prefer this option as it is more targeted, explicit, allows comfotable defaults, and still lighten visualize() arguments. They can be add_pvalue(value, color = "red2", fill = "pink", direction = NULL) and add_ci(endpoints, color = "mediumaquamarine", fill = "turquoise").
  6. Add title and labels.
    • Title is constructed based on the method and type of theoretical distribution.
    • Y label is either "count" (for method = "simulation") or "density".
    • X label is stat for method = "simulation" and <short distr name> stat where <short distr name> is taken from this function.
@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 3, 2018

@echasnovski Thanks for your patience on this.

I agree on all of 1-4. For 5, it was an oversight to have both obs_stat and end_points. That shouldn't be the case.

Agreed on 5i as well. Since add_external_stat() is a bit clunky and harder to understand, I think add_p_value() (with alias add_pvalue()) and add_ci() (with alias add_confidence_interval()) are cleaner and clearer.

Aligned on 6 as well.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 4, 2018

In the list I meant 5i and add_*() layer functions to be different options. The first one means only changes in internal logic of visualize(), as the second means moving that logic outside in separate functions. Do I understand correctly that creating new layer functions is the preferred direction?

As this is a major update that breaks tests and vignettes, this will take some time. I plan to do a PR in a couple of weeks.

Please note, there also will probably be a need to update the Modern Dive book.

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 4, 2018

Hmm, there was something nice about get_p_value() and get_confidence_interval() having roughly the same arguments as visualize(). Can you recap what the new layout would look like with some examples? I’m now not sure what arguments visualize() would take. There is no rush on any of this.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 4, 2018

The general idea is to move all logic of plotting (adding to plot) p-value and interval regions into separate functions. It means that arguments that define certain object go to respective functions:

  • visualize(data, bins = 15, method = "simulation", dens_color = "black"). This only plots histogram and/or density curve based on data.
  • add_p_value(obs_stat, direction = NULL, color = "red2", fill = "pink") (argument names and order is updated compared to my initial suggestion). By default adds only a vertical line with x coordinate at value. If direction is not NULL then also adds a shadow.
  • add_ci(endpoints, color = "mediumaquamarine", fill = "turquoise"). By default adds two vertical lines with x coordinates at endpoints and shades a region between them. Can be made to plot only two vertical lines if fill is NULL (to make more aligned with add_p_value()).

Some examples (with tbl being data frame of appropriate structure):

  • visualize(tbl) will behave exactly the same.
  • visualize(tbl, obs_stat = 1) will be replaced with visualize(tbl) + add_p_value(1).
  • visualize(tbl, obs_stat = 1, direction = "right") will be replaced with visualize(tbl) + add_p_value(1, direction = "right").
  • visualize(tbl, endpoints = c(0, 1), direction = "between") will be replaced with visualize(tbl) + add_ci(c(0, 1)).
  • With this design there will be an option to add both p-value and interval regions: visualize(tbl) + add_p_value(1) + add_ci(c(0, 1))
@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 4, 2018

Thanks much. So visualize() will only have the data argument and maybe method? Anything else as arguments there?

I think what you have proposed is a lot cleaner and more in line with the {infer} verbs and what {ggplot2} and {dplyr} use as well.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 4, 2018

Yep, data, method, bins and color (or dens_color if you wish that to only apply to density curve).

As clean as it is, it is hugely breaking, and the deprecation strategy of extra visualize() arguments should be discussed. Maybe leave them at first powered by new functions and giving a warning when they are used?

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 4, 2018

Definitely deprecate slowly with warnings

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 5, 2018

After some thinking, I have small doubts about names add_p_value() and add_ci(). Here are the reasons:

  • Prefix add_* doesn't clearly resonate for me as being related to graphics. Moreover, proposed functions are designed to be used with +, which introduces a subtle tautology.
  • What is really done in those functions isn't explicitly related to p-value and confidence interval. They create certain amount of vertical lines and possibly fill certain region(s) of plot. Moreover, their arguments don't reflect the intended relation: add_p_value()'s main argument is obs_stat and add_ci()'s - endpoints.

So I would like to discuss the more appropriate naming scheme to those pair of functions. It should be consistent and somehow relate to those statistical objects. My thoughts:

  • The prefix might be:
    • geom_*. Quite obvious choice, but might be confusing as geom_*() functions tend to have common set of arguments (data, mapping, etc.), which is not the case here.
    • layer_*. The reason why it is not the best choice is the same as in previous one.
    • shadow_* or shade_*. I like this these most, as they show essentially intended result and action.
  • The suffixes might be:
    • The initially proposed *_p_value and *_ci. Those aren't clearly bad but have problems I stated in the beginning.
    • *_tails and *_interval. To think about it, add_p_value() is intended to fill left and/or right region of the plot, which might be named "tails" to resonate more clearly with statistics in general and p-value in particular. Similarly, add_ci() is intented to fill some horisontal interval of the plot.

So to sum up, I think it might be better to rename add_p_value() into shadow_tails() and add_ci() into shadow_interval(). The side effect will be a prohibition of adding just vertical lines without shading, but this can be done with geom_vline() so shouldn't introduce much trouble.

@ismayc, @andrewpbray, what do you think?

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 5, 2018

Given that Hadley has said he wouldn't have done the + and would have gone instead with %>% if he had to do it all over again, I don't think going the + route makes sense here. The plotly package does use add_* for its functions using the pipe.

We are getting into some tricky water here though too since we are expecting students of infer have some experience with ggplot2 and dplyr (but not plotly) prior to using this. That's one reason why I decided to just include arguments in visualize().

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 5, 2018

I'm not opposed to shadow_* but one important part of using the infer package is to make sure students understand the p-value and what makes it up as well as understanding confidence intervals. shadow_interval() is OK. Maybe display_p_value() and display_confidence_interval() with alias display_ci() work OK?

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 5, 2018

The reason why I chose new functions to be essentially {ggplot2} layers is exactly because it should be familiar to potential users of {infer}. Thus using them by adding separate layers (with +) seemed natural.

I really like the display_* prefix. However, I would argue that current and proposed functionality of displaying p-value and CI actually do that:

  • P-value should be perceved as an area under the curve in extreme region, which is hard to visualize in case if histogram is present. Therefore, I believe, here an approach of shading extreme part(s) of the plot was chosen. This actually might be misleading during learning this concept for the first time, as no p-value is actually displayed.
  • With CI story is a little bit different. The main argument endpoints isn't actualy connected in any way to CI: it is just two numbers that represent some interval. Therefore, renaming to *_interval makes more sense to me.

In order to educate that display_tails() and display_interval() are related to p-value and CI respectively, we can mention it in help pages, vignettes, and README.

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 6, 2018

A p-value is well-approximated by the number of simulated statistics in the null distribution as extreme or more extreme as the observed statistic too. I think students catch on to this pretty well actually and then it is pretty natural to extend to just looking at the smooth approximation and the area under the curve as a p-value. I've had better success in teaching students about the p-value using simulation-based methods in this fashion than I have going straight into the p-value definition you have used. A development of this is in Chapter 10 of ModernDive.

Hmm, now you have me thinking though that shade_p_value() and shade_interval() are best. I'm also fine with shade_tails() as an alias of shade_p_value() but there is some inherent beauty in having p_value in the actual function since the arguments to that function require users to think about what the definition of the p-value is as they use it.

Since display could also mean "present" and one would expect a number or two to also be produced, I think it might not be the best choice. What do you think?

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 6, 2018

I like both display_* and shade_*. The first one doesn't imply that color filling should be done (not really contradicting of plotting only vertical lines), but the second one does. But you are right about possible misunderstanding of a number being produced.

I undersand the motivation of including p_value in function name but don't quite agree based on naming consistency. Function name should explain what it does as precise and vivid as it can: to be both "true" and rememberable. In my opinion, using shade_p_vaue() gives a message that the output will have something (some region of the plot) "shaded" that diectly represents p-value. For me, the current shading doesn't represent the p-value directly enough, but rather gives an intermediate helpful information to estimate it.

I think the following example might better illustrate my understanding of using p_value here. Consider function usage geom_proportion(X). I somewhat expect it to add information to plot that I can immediately use to say "The proportion contained in X column is equal to P". Instead it is implemented as geom_jitter(aes(x = X, y = 1)), i.e. it just plots points at certain levels of X. I can use it to illustrate what a proportion is and to compute its value in some particular case, but it is not what I would expect it to do.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 6, 2018

Maybe instead of *_tails suffix we can use *_extreme, i.e. shade_extreme()? As this word pops up very often in definition(s) of p-value, it might be explicit enough to illustrate (but not actually "display") this concept on the plot?

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 7, 2018

So if we shaded under the curve or the histogram directly, you'd be ok with shade_p_value()? But since we are shading as rectangles that's why you prefer shade_extreme()? Not sure I understand you wanting to use shade_interval() then given it also includes things outside of the actual distributions. I still think a visual interpretation of the p-value is the best way to get students to understand it with a clear mention of that in the function name. shade_interval() is fine since students don't tend to struggle with confidence intervals the same way they do with p-values. It's more than just students too though as lots of scientists don't understand p-values either.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 7, 2018

I think yes, if shading is done under the curve or on the histogram then shade_p_value() is a good choice.

The name shade_interval() "says" that some interval is to be shaded, as it is: interval on x-axis defined by endpoints. It just happened so that it is recommended (via help pages, vignettes, and other sources) to be used to show confidence interval of statistic. Maybe the similarity of terms ("interval" and "CI") enables that both you and I are fine with that name.

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 7, 2018

Fair enough 😄. The tricky thing about shading the histogram is that it will have some bins that have two different shadings, which is why we decided to just shade as rectangles instead in #51. You can't see what Albert is referring to anymore, but if you do a fill on a histogram based on a condition of being more extreme you'll see what I mean.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 7, 2018

Oh, I tried to create a solution, that shades the distribution, but failed. Mainly because with method = "both" it isn't clear what should be shaded: histogram or density curve. However, manipulations with bin coloring is an issue.

I guess, inability to create perfect solution algorithmically get me into trying to properly name function and supplying recommendations of its usage. And it will also be less breaking for the current solution.

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 7, 2018

shade_extreme() and shade_interval() are fine with me, but if there is confusion brought up by instructors on the learning of students on the p-value using {infer}, I’d also like shade_p_value() as an alias.

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 7, 2018

I am glad that I've managed to convince you😊

I am worried though to not sabotage any future teaching with this. Maybe, you know some third party educators that can explicitly help with this choice?

@ismayc

This comment has been minimized.

Copy link
Collaborator

ismayc commented Sep 7, 2018

@rudeboybert @apreshill @peterbchi @LucyMcGowan You have been summoned to help us determine what to call a function. Your experience and guidance are much appreciated in advance. Beer on me next time I see you. Apologies for the amount of reading here but I think you’ll get the gist by just reading above this comment a little bit.

@rudeboybert

This comment has been minimized.

Copy link
Contributor

rudeboybert commented Sep 8, 2018

Hello all! First off I appreciate the effort and thoughtfulness you are both putting into the discussion of "What do I call this function?" I feel like this thread should be submitted to "The Annals of 'Naming Things is Hard'" 😆

Before I answer, I'm going to paraphrase my understanding of the discussion so far. You have the following goals you would like to satisfy all at once:

  • Goal 1: Implement these visualization functions so that they are consistent with ggplot2 design principles. Note: The educator in me cares about this the least, but the tidyverse enthusiast in me cares about this alot as I generally prefer not re-inventing any wheels and as much as possible keeping things consistent with the tidyverse and now tidymodels ecosystems.
  • Goal 2: Name these visualization functions so that they are consistent with their true statistical definitions.
  • Goal 3: Minimize the number of layers of abstraction between "I want to visualize the p-value/CI" and "How do I do this in infer?" In particular with the newbie to statistics and/or computation in mind.

I say shade_p_value() & shade_ci() options are the current front runners. Here are my reasons:

  • While VERB_extreme() for p-values and VERB_interval() for CI's does satisfy Goal 2 the best given how visualize() works, I think this comes too much at the expense of Goal 3. At the very least the function names should have the goal in it: VERB_p_value() and VERB_ci() (or conf_int or wtv is consistent with tidymodels)
  • Agreed, I now favor scrapping add_*(). We're not adding a layer to a plot for sake of adding a layer to a plot, but rather we want to "visualize" the p-value or CI. And avoiding any subtle tautologies about "adding an add layer" is a nice corollary 😄
  • As @echasnovski pointed out, display_*() is a strong verb suggesting "this is right." I think shade_*() is just ambiguous enough to not get Goal 2 wrong, but is still "actionable" enough to satisfy Goal 3.

Also, maybe consider shade_lower_ci() and shade_upper_ci() wrapper functions as well?

@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 9, 2018

The goal here, as I see it, is to find a balance (through naming) between perception of what shade_*() functions will actually do and will be used to represent.

Here are the problems more formulated in code. Maybe it will help.

P-value

For now {infer} has the following way of producing the plot:

set.seed(10001)
mtcars_tbl <- mtcars %>%
  dplyr::mutate(am = factor(am)) %>%
  specify(mpg ~ am) %>% # alt: response = mpg, explanatory = am
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "t", order = c("1", "0"))

visualize(mtcars_tbl, method = "both", obs_stat = 1.5, direction = "both")

example-plot-pval

There is an idea to split plotting of distribution(s) and additional visual elements. We struggle to choose between two versions. In both of them visualize() plots only distributions.

  1. Using shade_p_value() to add "layer" with shading extreme regions which are to be used for visual presentation of p-value definition.
# `visualize()` plots only distributions
visualize(mtcars_tbl, method = "both") +
# `shade_*()` add shaded regions
  shade_p_value(obs_stat = 1.5, direction = "both")
  1. Using shade_extreme() for the same purpose.
visualize(mtcars_tbl, method = "both") +
  shade_extreme(obs_stat = 1.5, direction = "both")

All three examples will provide the same final plot.

CI

The same with CI.

# Current plotting
visualize(
  mtcars_tbl, method = "both", endpoints = c(-1.5, 1.5), direction = "between"
)

example-plot-ci

# With `shade_ci()`
visualize(mtcars_tbl, method = "both") +
  shade_ci(endpoints = c(-1.5, 1.5))

# With `shade_interval()`
visualize(mtcars_tbl, method = "both") +
  shade_interval(endpoints = c(-1.5, 1.5))

Again, all three examples will provide the same final plot.

@peterbchi

This comment has been minimized.

Copy link

peterbchi commented Sep 12, 2018

@ismayc thanks for including me in the discussion! I've read a bit of it and in particular really appreciated @rudeboybert 's summary/questions. I haven't dived in deep enough to form any of my own opinions or suggestions though. Just wanted to let you know that I saw this and am very interested but don't have anything to contribute yet. I'll have more time on Friday to look again!

echasnovski added a commit to echasnovski/infer that referenced this issue Sep 13, 2018

Start refactoring of `visualize()` (tidymodels#178).
Also:
- Adds `shade_p_value()`, `shade_pvalue()`, `shade_confidence_interval()`, and `shade_ci()`.
@echasnovski

This comment has been minimized.

Copy link
Collaborator Author

echasnovski commented Sep 24, 2018

Implemented in #198.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment