Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geom_density() for bounded data #3387

Closed
echasnovski opened this issue Jul 2, 2019 · 8 comments
Closed

geom_density() for bounded data #3387

echasnovski opened this issue Jul 2, 2019 · 8 comments
Labels
feature a feature request or enhancement layers 📈

Comments

@echasnovski
Copy link
Contributor

Currently geom_density() directly uses density() to compute kernel density estimation. Due to its nature, it has "extending property": output density can imply that values outside the range of input data are possible (with default "gaussian" kernel). This is a realistic practical setup, but there are cases when this is not true and data is bounded (for example, when only positive values are possible). It can be a good idea to support kernel density estimation on this type of bounded data with new bounds argument of geom_density().

In my opinion, one of the methods most easiest to implement, understand, and teach, is "reflection" method. There is one possible description on this page. Basically, boundary correction is done by doing "standard" kernel density estimation first, and then "reflecting" tails outside of desired interval to be inside. Densities inside and outside of desired interval are added together in "symmetric fashion": d(x) = d_f(x) + d_f(l - (x-l)) + d_f(r + (r-x)), where d_f is density of input, d is density of output, l and r are left and right edges of desired interval.

I made some quick and dirty changes to ggplot2 for demonstration. stat_density() gets bounds argument with default value of c(-Inf, Inf). Here are some examples of proposed functionality:

library(tibble)

set.seed(101)

ggplot(tibble(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, 1), color = "blue", ) +
  stat_function(data = tibble(x = c(0, 1)), fun = dunif, color = "red")

ggplot(tibble(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, Inf), color = "blue") +
  stat_function(data = tibble(x = c(0, 5)), fun = dexp, color = "red")

@echasnovski
Copy link
Contributor Author

Here are more appropriate plots with plot limits completely representing current behavior:

library(ggplot2)

set.seed(101)

ggplot(data.frame(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, 1)) +
  stat_function(data = data.frame(x = c(0, 1)), fun = dunif, color = "red")

ggplot(data.frame(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, Inf)) +
  stat_function(data = data.frame(x = c(0, 5)), fun = dexp, color = "red")

Created on 2019-07-03 by the reprex package (v0.3.0)

@MatthieuStigler
Copy link

Sorry, a naive question: if geom_density is using stats::density(), why not simply allow to passing arguments from and to in stats::density()?

@echasnovski
Copy link
Contributor Author

Not a naive question at all!

Arguments from and to represent "the left and right-most points of the grid at which the density is to be estimated;...". They don't alter the density estimation process, while this suggestion modifies output so that it describes density better in case there are known constraints on data generating process.

Here is an example. Black line demonstrates "raw" estimation, that goes outside of [0; 1] segment (which, we know should never happen). Blue line is a result with from = 0 and to = 1, and it just clips the portion of original curve (and as result is not a density estimate itself, because total square is less than 1).

library(ggplot2)

set.seed(42)
x <- runif(100)

den_default <- as.data.frame(density(x)[c("x", "y")])
den_cut <- as.data.frame(density(x, from = 0, to = 1)[c("x", "y")])

ggplot(mapping = aes(x, y)) +
  geom_line(data = den_cut, color = "blue", size = 2) +
  geom_line(data = den_default, color = "black")

Created on 2020-04-30 by the reprex package (v0.3.0)

@MatthieuStigler
Copy link

ok, thanks for the explanation!!

@spspitze
Copy link

spspitze commented May 17, 2020

Hello @echasnovski,

This proposed function looks really cool and useful. If you don't mind me asking, could you share your code that created it? Thanks.

Edit: Fantastic. Thank you so much.

@echasnovski
Copy link
Contributor Author

Thanks, @spspitze.

I decided to make PR (#4013) for this feature to make it more publicly accessible.
Currently, to actually use it, you have two choices:

  • Install version of 'ggplot2' from my fork (which you might "overwrite" when updating 'ggplot2' later on) via:
# install.packages("remotes")
remotes::install_github("echasnovski/ggplot2", ref="geom_density-bounds")
  • Somehow use the code from my fork directly. All relevant updates are in this file.

@nstjhp
Copy link

nstjhp commented Nov 25, 2020

Hi @echasnovski - thanks very much for this: I used your fork to good effect for a plot I was making, and also this issue made me aware of how the default clipped geom_densitys produced might not have area=1, which I find troubling.

Reading further into the subject, and for anyone else interested, I found the evmix package, and trying the examples for bckden it looks to be very good. Also their paper gives some more descriptions of the different options for dealing with boundary correction.

I suspect ultimately it will be necessary to do the computations for this outside of ggplot2 (perhaps one can use these pre-calculated densities with geom_density(stat="identity" ?), though of course much more convenient to have a bounds argument as you demonstrate!

@teunbrand
Copy link
Collaborator

This issue was fixed by #4013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement layers 📈
Projects
None yet
Development

No branches or pull requests

6 participants