Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bounds argument to geom_density() #4013

Merged
merged 9 commits into from Jul 26, 2022

Conversation

echasnovski
Copy link
Contributor

This is a PR for #3387 and serves as a place for discussion and feedback about possible change.

The goal is to add bounds argument to geom_density() and stat_density(), which will allow users to specify known constraints on data to obtain more "realistic" density estimation. Change to computed output of stats::density() is made using relatively simple "reflection method of boundary correction" roughly taken from "Simple boundary correction for kernel density estimation” by M.C. Jones, 1993 (you can look at free pdf-file of presentation based on this article, in particular, page 7).

Currently this PR serves two purposes:

  • Show that this is a completely non-breaking update. All previous behavior is explicitly preserved.
  • Allow everyone to "play" with it.

If agreed that this is a desirable change, at least the following should also be made (which I am happy to do):

  • Figure out when and how to notify user about "bad bounds". For example, when some or all data points are outside of bounds interval, or simply it doesn't represent a valid interval.
  • Add examples of bounds usage.

@echasnovski
Copy link
Contributor Author

echasnovski commented May 19, 2020

Examples of usage:

library(ggplot2)
library(tibble)

set.seed(101)

ggplot(tibble(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, 1), color = "blue") +
  stat_function(fun = dunif, color = "red")

ggplot(tibble(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, Inf), color = "blue") +
  stat_function(fun = dexp, color = "red")

Created on 2020-05-19 by the reprex package (v0.3.0)

@thomasp85
Copy link
Member

@clauswilke do you have time to take a look at this - I don't have any problems with the functionality but I'm quite rusty on density estimation code

@thomasp85 thomasp85 added this to the ggplot2 3.4.0 milestone Mar 25, 2021
@thomasp85
Copy link
Member

This is not pressing btw - it will not be part of the patch release

@clauswilke clauswilke self-assigned this Mar 25, 2021
@clauswilke
Copy link
Member

Sure, assigned to myself.

@echasnovski I will review within the next few weeks or so. If I don't, please ping me to remind me.

@echasnovski
Copy link
Contributor Author

echasnovski commented Apr 20, 2021

ping @clauswilke

@clauswilke
Copy link
Member

Apologies, please ping again after May 10. I have two more weeks of teaching and it's using up nearly all my time.

@echasnovski
Copy link
Contributor Author

Sure, no worries :)

@echasnovski
Copy link
Contributor Author

@clauswilke, how about ping number 2?

Copy link
Member

@clauswilke clauswilke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks reasonable. But please add some comments to document the logic, and also explain the logic in the exported documentation.

R/stat-density.r Outdated
@@ -122,8 +129,15 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
), n = 1))
}

dens <- stats::density(x, weights = w, bw = bw, adjust = adjust,
kernel = kernel, n = n, from = from, to = to)
if (all(is.infinite(bounds))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments here explaining the thought process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

R/stat-density.r Show resolved Hide resolved
R/stat-density.r Outdated Show resolved Hide resolved
R/stat-density.r Outdated
@@ -15,6 +15,8 @@
#' not line-up, and hence you won't be able to stack density values.
#' This parameter only matters if you are displaying multiple densities in
#' one plot or if you are manually adjusting the scale limits.
#' @param bounds Known lower and upper bounds for estimated data. Default
#' `c(-Inf, Inf)` means that there are no (finite) bounds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be some explanation somewhere in the documentation of what happens when the bounds are finite. And what happens if data points fall outside the set bounds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more of an open question to package authors. What do you think is the best behavior here? It mostly depends on the implied severity of bounds argument.

Currently no check is done. It estimates density and then tries to "reflect" it at edges. There are some extreme cases (like when bounds and data range not intersect at all), which I decided to tackle after the decision about bounds severity.

I can suggest at least three options:

  • Give a warning when at least one point is outside of bounds. Proceed as nothing happened. Implies that bounds is essentially for visualization purpose.
  • Remove points outside of bounds before or after estimating density. This implies that bounds is a data generation assumption and tries to fix it.
  • Give an error if at least one point is outside. This also implies data generation but stops early. I personally like this the most.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented about what is essentially done if any of bounds is finite.

The severity of bounds effect is still a question to you.

@clauswilke
Copy link
Member

With this PR, would the bounds argument generally be preferred over from and to? I notice that from and to are not documented, so that's maybe a good thing.

@echasnovski
Copy link
Contributor Author

With this PR, would the bounds argument generally be preferred over from and to? I notice that from and to are not documented, so that's maybe a good thing.

As I understand it, from and to are not arguments here. They are computed based on data range (here) and result into computed density being clipped at those edges.

Proposed bounds and pair of computed from and to result into two completely different behaviors. While from and to don't affect estimated density (they simply ignore outer tails which result into plot not having total square of 1), bounds affects estimated density by "reflecting" those tails inside bounded segment.

@thomasp85
Copy link
Member

@clauswilke are you interested in finishing this off with @echasnovski ?

@echasnovski
Copy link
Contributor Author

I have already forgotten the details of it :(
But if you (plural) think it is a worthwhile addition to {ggplot2}, I'll try to put some time into it.

@clauswilke
Copy link
Member

Same here. Forgotten the details, but I think it's reasonable to revive.

@echasnovski
Copy link
Contributor Author

I addressed previous comments. All other work seems to be better done after consensus about how severe bounds argument should be (see comments).

@thomasp85
Copy link
Member

@clauswilke I'll let you decide on the final points for this PR

@clauswilke
Copy link
Member

@echasnovski Sorry, I missed your update from May 12. Please always make sure to tag me, as I don't currently have the time to read every ggplot2 notification that comes my way but I try to pay attention to cases where I'm tagged.

Regarding bounds, I'm not generally in favor of any approach that errors out when it doesn't have to. Sometimes users want to do weird stuff for valid reasons, and we shouldn't prevent them from doing so. I think issuing a warning but otherwise performing a reasonable action is the way to go. So maybe remove points outside of bounds and warn that this has happened? This would be similar to the treatment of limits in scales functions.

@echasnovski
Copy link
Contributor Author

Hi, @clauswilke!

I updated this to remove data points outside of bounds with a warning. Also added example of bounds usage. What else should be done here?

I imagine, some tests should be done, but I didn't find any present substantial tests for either geom_density() or stat_density().

@clauswilke
Copy link
Member

@echasnovski Yes, some tests would be good. Just because there is no test currently for stat_density() doesn't mean we shouldn't add one. It would actually be a good idea to test the full density estimation. Create a small data set, run it through ggplot2, extract the data with layer_data(), then also perform the density estimation manually in the test script, and compare that the two are the same.

Also test that no adjustments are made when no boundaries are set.

You probably also have to merge in the current main branch, specifically to get the tests working again. There is a failure currently that I don't think is caused by your code.

@thomasp85 I'll be traveling through the remainder of July, with limited internet access. I may not be able to respond to developments in this PR. You're welcome to take over if necessary. I'll be back in August.

@echasnovski
Copy link
Contributor Author

echasnovski commented Jul 12, 2022

@echasnovski Yes, some tests would be good. Just because there is no test currently for stat_density() doesn't mean we shouldn't add one. It would actually be a good idea to test the full density estimation. Create a small data set, run it through ggplot2, extract the data with layer_data(), then also perform the density estimation manually in the test script, and compare that the two are the same.

@clauswilke, this showed to be not trivial. The reason is that ggplot() outputs data frame with grid range equal to data range, while stats::density() makes its own, extended range. I ended up testing functional approximation at some test sample.

Other than that, I added snapshot tests for different bounds and how stat_density() handles data outside of bounds. This feels like enough test cases while not being too restrictive.

Also I didn't manage to incorporate weights into density estimation. Here is what I got:

ggplot(mtcars, aes(mpg, weight = cyl)) + stat_density()
Warning message:
The following aesthetics were dropped during statistical transformation: weightThis can happen when ggplot fails to infer the correct grouping structure in the data.Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor? 

Also test that no adjustments are made when no boundaries are set.

Not entirely sure what you mean here. No boundary correction with default boundary is tested at equivalence of stat_density() and plain stats::density().

expect_snapshot(make_density(c(-Inf, Inf)))
expect_snapshot(make_density(c(mpg_min, Inf)))
expect_snapshot(make_density(c(-Inf, mpg_max)))
expect_snapshot(make_density(c(mpg_min, mpg_max)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that you're using snapshots here? Couldn't you just test numerically that density is approximately twice higher at finite boundary? It doesn't have to be that accurate. You could just test that it's > 1.9x and < 2.1x, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clauswilke, Snapshots seem to provide more full testing of the feature. This is not only about being twice as higher at finite boundary, but about a specific way of dealing with boundary correction (via density reflection). I can make snapshots smaller by setting smaller n.

All in all, I'd keep them. Will do as you ask, of course. So should I not use snapshots?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like snapshots because if they break at some point in the future nobody will know what the correct result should have been. If you want to test the full reflection, then the right way to do this is to calculate the reflection again in the test and compare numerically. Yes, it may seem silly to run the same calculation in two locations and compare they're the same, but it actually is a good check for the correctness of the implementation in the main library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test to not use snapshots.

@clauswilke
Copy link
Member

I think your tests are good. I made some specific comments regarding one. I'm not a big fan of snapshot tests and believe they should be avoided whenever possible.

Regarding weights, did you check whether the computation was correct? I see no reason that it shouldn't. The warning occurs because I missed to add the following line in my recent PR, could you add it to stat_density()?

dropped_aes = "weight" # after statistical transformation, weights are no longer available

@echasnovski
Copy link
Contributor Author

Regarding weights, did you check whether the computation was correct? I see no reason that it shouldn't. The warning occurs because I missed to add the following line in my recent PR, could you add it to stat_density()?

Yep, you are right. weight is used and the warning is gone after adding dropped_aes = "weight". Should I keep it in this PR?

@clauswilke
Copy link
Member

Yes, please keep dropped_aes = "weight". I should have added it in my recent PR and I forgot. It makes sense to add it to yours, in particular if you add a test for weighted density estimations (but even if you don't, no strong opinion on my end).

@echasnovski
Copy link
Contributor Author

@clauswilke:

  • Updated test to not use snapshots.
  • Added dropped_aes and test for ability to do weighted density estimation.
  • Updated test for out of bounds data to respect weight.

@clauswilke
Copy link
Member

This looks good to me now. Only one last thing: Please add a news item to the top of: https://github.com/tidyverse/ggplot2/blob/main/NEWS.md
You can model it after the other ones there. We generally want to link the issue or PR that contributed the new feature and the author of the feature.

@echasnovski
Copy link
Contributor Author

This looks good to me now. Only one last thing: Please add a news item to the top of: https://github.com/tidyverse/ggplot2/blob/main/NEWS.md

@clauswilke, done.

@clauswilke
Copy link
Member

@thomasp85 This looks good to me. Do you want to give it a quick look also before we merge?

Copy link
Member

@thomasp85 thomasp85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@clauswilke clauswilke merged commit f6b3833 into tidyverse:main Jul 26, 2022
@clauswilke
Copy link
Member

@echasnovski Thanks!

@echasnovski
Copy link
Contributor Author

@clauswilke, @thomasp85, thank you. The pleasure is mine to finally give back to the R package for visualization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants