Add `bounds` argument to `geom_density()` #4013

echasnovski · 2020-05-19T16:30:35Z

This is a PR for #3387 and serves as a place for discussion and feedback about possible change.

The goal is to add bounds argument to geom_density() and stat_density(), which will allow users to specify known constraints on data to obtain more "realistic" density estimation. Change to computed output of stats::density() is made using relatively simple "reflection method of boundary correction" roughly taken from "Simple boundary correction for kernel density estimation” by M.C. Jones, 1993 (you can look at free pdf-file of presentation based on this article, in particular, page 7).

Currently this PR serves two purposes:

Show that this is a completely non-breaking update. All previous behavior is explicitly preserved.
Allow everyone to "play" with it.

If agreed that this is a desirable change, at least the following should also be made (which I am happy to do):

Figure out when and how to notify user about "bad bounds". For example, when some or all data points are outside of bounds interval, or simply it doesn't represent a valid interval.
Add examples of bounds usage.

echasnovski · 2020-05-19T16:32:29Z

Examples of usage:

library(ggplot2)
library(tibble)

set.seed(101)

ggplot(tibble(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, 1), color = "blue") +
  stat_function(fun = dunif, color = "red")

ggplot(tibble(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, Inf), color = "blue") +
  stat_function(fun = dexp, color = "red")

^{Created on 2020-05-19 by the reprex package (v0.3.0)}

thomasp85 · 2021-03-25T13:46:04Z

@clauswilke do you have time to take a look at this - I don't have any problems with the functionality but I'm quite rusty on density estimation code

thomasp85 · 2021-03-25T13:46:33Z

This is not pressing btw - it will not be part of the patch release

clauswilke · 2021-03-25T17:22:05Z

Sure, assigned to myself.

@echasnovski I will review within the next few weeks or so. If I don't, please ping me to remind me.

echasnovski · 2021-04-20T13:54:55Z

ping @clauswilke

clauswilke · 2021-04-20T16:08:21Z

Apologies, please ping again after May 10. I have two more weeks of teaching and it's using up nearly all my time.

echasnovski · 2021-04-20T18:27:15Z

Sure, no worries :)

echasnovski · 2021-06-28T14:09:58Z

@clauswilke, how about ping number 2?

clauswilke

Generally looks reasonable. But please add some comments to document the logic, and also explain the logic in the exported documentation.

clauswilke · 2021-06-28T16:20:13Z

R/stat-density.r

@@ -122,8 +129,15 @@ compute_density <- function(x, w, from, to, bw = "nrd0", adjust = 1,
    ), n = 1))
  }

-  dens <- stats::density(x, weights = w, bw = bw, adjust = adjust,
-    kernel = kernel, n = n, from = from, to = to)
+  if (all(is.infinite(bounds))) {


Please add some comments here explaining the thought process.

R/stat-density.r

clauswilke · 2021-06-28T16:21:35Z

R/stat-density.r

@@ -15,6 +15,8 @@
 #'   not line-up, and hence you won't be able to stack density values.
 #'   This parameter only matters if you are displaying multiple densities in
 #'   one plot or if you are manually adjusting the scale limits.
+#' @param bounds Known lower and upper bounds for estimated data. Default
+#'   `c(-Inf, Inf)` means that there are no (finite) bounds.


There should be some explanation somewhere in the documentation of what happens when the bounds are finite. And what happens if data points fall outside the set bounds?

This was more of an open question to package authors. What do you think is the best behavior here? It mostly depends on the implied severity of bounds argument.

Currently no check is done. It estimates density and then tries to "reflect" it at edges. There are some extreme cases (like when bounds and data range not intersect at all), which I decided to tackle after the decision about bounds severity.

I can suggest at least three options:

Give a warning when at least one point is outside of bounds. Proceed as nothing happened. Implies that bounds is essentially for visualization purpose.

Remove points outside of bounds before or after estimating density. This implies that bounds is a data generation assumption and tries to fix it.

Give an error if at least one point is outside. This also implies data generation but stops early. I personally like this the most.

Documented about what is essentially done if any of bounds is finite.

The severity of bounds effect is still a question to you.

clauswilke · 2021-06-29T14:57:20Z

With this PR, would the bounds argument generally be preferred over from and to? I notice that from and to are not documented, so that's maybe a good thing.

echasnovski · 2021-06-29T17:55:12Z

With this PR, would the bounds argument generally be preferred over from and to? I notice that from and to are not documented, so that's maybe a good thing.

As I understand it, from and to are not arguments here. They are computed based on data range (here) and result into computed density being clipped at those edges.

Proposed bounds and pair of computed from and to result into two completely different behaviors. While from and to don't affect estimated density (they simply ignore outer tails which result into plot not having total square of 1), bounds affects estimated density by "reflecting" those tails inside bounded segment.

thomasp85 · 2022-05-11T11:25:26Z

@clauswilke are you interested in finishing this off with @echasnovski ?

echasnovski · 2022-05-11T11:34:01Z

I have already forgotten the details of it :(
But if you (plural) think it is a worthwhile addition to {ggplot2}, I'll try to put some time into it.

clauswilke · 2022-05-11T16:33:25Z

Same here. Forgotten the details, but I think it's reasonable to revive.

echasnovski · 2022-05-12T15:42:25Z

I addressed previous comments. All other work seems to be better done after consensus about how severe bounds argument should be (see comments).

thomasp85 · 2022-07-06T07:55:38Z

@clauswilke I'll let you decide on the final points for this PR

clauswilke · 2022-07-07T00:08:35Z

@echasnovski Sorry, I missed your update from May 12. Please always make sure to tag me, as I don't currently have the time to read every ggplot2 notification that comes my way but I try to pay attention to cases where I'm tagged.

Regarding bounds, I'm not generally in favor of any approach that errors out when it doesn't have to. Sometimes users want to do weird stuff for valid reasons, and we shouldn't prevent them from doing so. I think issuing a warning but otherwise performing a reasonable action is the way to go. So maybe remove points outside of bounds and warn that this has happened? This would be similar to the treatment of limits in scales functions.

echasnovski · 2022-07-12T13:52:14Z

Hi, @clauswilke!

I updated this to remove data points outside of bounds with a warning. Also added example of bounds usage. What else should be done here?

I imagine, some tests should be done, but I didn't find any present substantial tests for either geom_density() or stat_density().

clauswilke · 2022-07-12T15:02:09Z

@echasnovski Yes, some tests would be good. Just because there is no test currently for stat_density() doesn't mean we shouldn't add one. It would actually be a good idea to test the full density estimation. Create a small data set, run it through ggplot2, extract the data with layer_data(), then also perform the density estimation manually in the test script, and compare that the two are the same.

Also test that no adjustments are made when no boundaries are set.

You probably also have to merge in the current main branch, specifically to get the tests working again. There is a failure currently that I don't think is caused by your code.

@thomasp85 I'll be traveling through the remainder of July, with limited internet access. I may not be able to respond to developments in this PR. You're welcome to take over if necessary. I'll be back in August.

echasnovski · 2022-07-12T16:15:41Z

@echasnovski Yes, some tests would be good. Just because there is no test currently for stat_density() doesn't mean we shouldn't add one. It would actually be a good idea to test the full density estimation. Create a small data set, run it through ggplot2, extract the data with layer_data(), then also perform the density estimation manually in the test script, and compare that the two are the same.

@clauswilke, this showed to be not trivial. The reason is that ggplot() outputs data frame with grid range equal to data range, while stats::density() makes its own, extended range. I ended up testing functional approximation at some test sample.

Other than that, I added snapshot tests for different bounds and how stat_density() handles data outside of bounds. This feels like enough test cases while not being too restrictive.

Also I didn't manage to incorporate weights into density estimation. Here is what I got:

ggplot(mtcars, aes(mpg, weight = cyl)) + stat_density()
Warning message:
The following aesthetics were dropped during statistical transformation: weight
ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?

Also test that no adjustments are made when no boundaries are set.

Not entirely sure what you mean here. No boundary correction with default boundary is tested at equivalence of stat_density() and plain stats::density().

clauswilke · 2022-07-12T16:33:50Z

tests/testthat/test-stat-density.R

+  expect_snapshot(make_density(c(-Inf, Inf)))
+  expect_snapshot(make_density(c(mpg_min, Inf)))
+  expect_snapshot(make_density(c(-Inf, mpg_max)))
+  expect_snapshot(make_density(c(mpg_min, mpg_max)))


Is there a reason that you're using snapshots here? Couldn't you just test numerically that density is approximately twice higher at finite boundary? It doesn't have to be that accurate. You could just test that it's > 1.9x and < 2.1x, for example.

@clauswilke, Snapshots seem to provide more full testing of the feature. This is not only about being twice as higher at finite boundary, but about a specific way of dealing with boundary correction (via density reflection). I can make snapshots smaller by setting smaller n.

All in all, I'd keep them. Will do as you ask, of course. So should I not use snapshots?

I don't like snapshots because if they break at some point in the future nobody will know what the correct result should have been. If you want to test the full reflection, then the right way to do this is to calculate the reflection again in the test and compare numerically. Yes, it may seem silly to run the same calculation in two locations and compare they're the same, but it actually is a good check for the correctness of the implementation in the main library.

Updated test to not use snapshots.

clauswilke · 2022-07-12T16:37:28Z

I think your tests are good. I made some specific comments regarding one. I'm not a big fan of snapshot tests and believe they should be avoided whenever possible.

Regarding weights, did you check whether the computation was correct? I see no reason that it shouldn't. The warning occurs because I missed to add the following line in my recent PR, could you add it to stat_density()?

ggplot2/R/stat-bin.r

Line 177 in a979ffd

    
           dropped_aes = "weight" # after statistical transformation, weights are no longer available

echasnovski · 2022-07-12T16:46:34Z

Regarding weights, did you check whether the computation was correct? I see no reason that it shouldn't. The warning occurs because I missed to add the following line in my recent PR, could you add it to stat_density()?

Yep, you are right. weight is used and the warning is gone after adding dropped_aes = "weight". Should I keep it in this PR?

clauswilke · 2022-07-12T16:48:48Z

Yes, please keep dropped_aes = "weight". I should have added it in my recent PR and I forgot. It makes sense to add it to yours, in particular if you add a test for weighted density estimations (but even if you don't, no strong opinion on my end).

echasnovski · 2022-07-12T17:22:39Z

@clauswilke:

Updated test to not use snapshots.
Added dropped_aes and test for ability to do weighted density estimation.
Updated test for out of bounds data to respect weight.

clauswilke · 2022-07-12T18:08:03Z

This looks good to me now. Only one last thing: Please add a news item to the top of: https://github.com/tidyverse/ggplot2/blob/main/NEWS.md
You can model it after the other ones there. We generally want to link the issue or PR that contributed the new feature and the author of the feature.

echasnovski · 2022-07-12T18:30:49Z

This looks good to me now. Only one last thing: Please add a news item to the top of: https://github.com/tidyverse/ggplot2/blob/main/NEWS.md

@clauswilke, done.

clauswilke · 2022-07-12T19:04:26Z

@thomasp85 This looks good to me. Do you want to give it a quick look also before we merge?

thomasp85

LGTM

clauswilke · 2022-07-26T20:52:20Z

@echasnovski Thanks!

echasnovski · 2022-07-27T06:32:34Z

@clauswilke, @thomasp85, thank you. The pleasure is mine to finally give back to the R package for visualization.

Draft update StatDensity to have bounds argument.

c42ca9d

echasnovski mentioned this pull request May 19, 2020

geom_density() for bounded data #3387

Closed

thomasp85 added this to the ggplot2 3.4.0 milestone Mar 25, 2021

clauswilke self-assigned this Mar 25, 2021

clauswilke reviewed Jun 28, 2021

View reviewed changes

Refresh bounds usage in StatDensity.

9c5eeef

echasnovski force-pushed the geom_density-bounds branch from 8b9734d to 9c5eeef Compare May 12, 2022 15:38

echasnovski added 2 commits July 12, 2022 16:28

Handle data points outside of bounds.

9bd40d4

Update documentation with bounds example.

dc6f406

Merge branch 'main' into geom_density-bounds

e8c4196

echasnovski added 2 commits July 12, 2022 18:03

Use cli::cli_warn() instead of warn().

71900e0

Add tests for stat_density().

5a625f7

echasnovski force-pushed the geom_density-bounds branch from d239de8 to 5a625f7 Compare July 12, 2022 16:11

clauswilke reviewed Jul 12, 2022

View reviewed changes

Update tests for stat_density().

57229b3

Update 'NEWS.md'.

5173f97

clauswilke approved these changes Jul 12, 2022

View reviewed changes

thomasp85 approved these changes Jul 26, 2022

View reviewed changes

clauswilke merged commit f6b3833 into tidyverse:main Jul 26, 2022

jashapiro mentioned this pull request Oct 19, 2023

Cell type report: Update CellAssign ridgeplot section AlexsLemonade/scpca-nf#518

Closed

Add bounds argument to geom_density() #4013

Add bounds argument to geom_density() #4013

Conversation

echasnovski commented May 19, 2020

echasnovski commented May 19, 2020 • edited

thomasp85 commented Mar 25, 2021

thomasp85 commented Mar 25, 2021

clauswilke commented Mar 25, 2021

echasnovski commented Apr 20, 2021 • edited

clauswilke commented Apr 20, 2021

echasnovski commented Apr 20, 2021

echasnovski commented Jun 28, 2021

clauswilke left a comment

Choose a reason for hiding this comment

clauswilke Jun 28, 2021

Choose a reason for hiding this comment

echasnovski May 12, 2022

Choose a reason for hiding this comment

clauswilke Jun 28, 2021

Choose a reason for hiding this comment

echasnovski Jun 29, 2021

Choose a reason for hiding this comment

echasnovski May 12, 2022 • edited

Choose a reason for hiding this comment

clauswilke commented Jun 29, 2021

echasnovski commented Jun 29, 2021

thomasp85 commented May 11, 2022

echasnovski commented May 11, 2022

clauswilke commented May 11, 2022

echasnovski commented May 12, 2022

thomasp85 commented Jul 6, 2022

clauswilke commented Jul 7, 2022

echasnovski commented Jul 12, 2022

clauswilke commented Jul 12, 2022

echasnovski commented Jul 12, 2022 • edited

clauswilke Jul 12, 2022

Choose a reason for hiding this comment

echasnovski Jul 12, 2022 • edited

Choose a reason for hiding this comment

clauswilke Jul 12, 2022

Choose a reason for hiding this comment

echasnovski Jul 12, 2022

Choose a reason for hiding this comment

clauswilke commented Jul 12, 2022

echasnovski commented Jul 12, 2022

clauswilke commented Jul 12, 2022

echasnovski commented Jul 12, 2022

clauswilke commented Jul 12, 2022

echasnovski commented Jul 12, 2022

clauswilke commented Jul 12, 2022

thomasp85 left a comment

Choose a reason for hiding this comment

clauswilke commented Jul 26, 2022

echasnovski commented Jul 27, 2022

Add `bounds` argument to `geom_density()` #4013

Add `bounds` argument to `geom_density()` #4013

echasnovski commented May 19, 2020 •

edited

echasnovski commented Apr 20, 2021 •

edited

echasnovski May 12, 2022 •

edited

echasnovski commented Jul 12, 2022 •

edited

echasnovski Jul 12, 2022 •

edited