`geom_density()` for bounded data #3387

echasnovski · 2019-07-02T15:09:36Z

Currently geom_density() directly uses density() to compute kernel density estimation. Due to its nature, it has "extending property": output density can imply that values outside the range of input data are possible (with default "gaussian" kernel). This is a realistic practical setup, but there are cases when this is not true and data is bounded (for example, when only positive values are possible). It can be a good idea to support kernel density estimation on this type of bounded data with new bounds argument of geom_density().

In my opinion, one of the methods most easiest to implement, understand, and teach, is "reflection" method. There is one possible description on this page. Basically, boundary correction is done by doing "standard" kernel density estimation first, and then "reflecting" tails outside of desired interval to be inside. Densities inside and outside of desired interval are added together in "symmetric fashion": d(x) = d_f(x) + d_f(l - (x-l)) + d_f(r + (r-x)), where d_f is density of input, d is density of output, l and r are left and right edges of desired interval.

I made some quick and dirty changes to ggplot2 for demonstration. stat_density() gets bounds argument with default value of c(-Inf, Inf). Here are some examples of proposed functionality:

library(tibble)

set.seed(101)

ggplot(tibble(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, 1), color = "blue", ) +
  stat_function(data = tibble(x = c(0, 1)), fun = dunif, color = "red")

ggplot(tibble(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(bounds = c(0, Inf), color = "blue") +
  stat_function(data = tibble(x = c(0, 5)), fun = dexp, color = "red")

The text was updated successfully, but these errors were encountered:

echasnovski · 2019-07-03T08:50:54Z

Here are more appropriate plots with plot limits completely representing current behavior:

library(ggplot2)

set.seed(101)

ggplot(data.frame(x = runif(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, 1)) +
  stat_function(data = data.frame(x = c(0, 1)), fun = dunif, color = "red")

ggplot(data.frame(x = rexp(100)), aes(x)) +
  geom_density() +
  geom_density(color = "blue", bounds = c(0, Inf)) +
  stat_function(data = data.frame(x = c(0, 5)), fun = dexp, color = "red")

^{Created on 2019-07-03 by the reprex package (v0.3.0)}

MatthieuStigler · 2020-04-30T18:04:20Z

Sorry, a naive question: if geom_density is using stats::density(), why not simply allow to passing arguments from and to in stats::density()?

echasnovski · 2020-04-30T18:24:29Z

Not a naive question at all!

Arguments from and to represent "the left and right-most points of the grid at which the density is to be estimated;...". They don't alter the density estimation process, while this suggestion modifies output so that it describes density better in case there are known constraints on data generating process.

Here is an example. Black line demonstrates "raw" estimation, that goes outside of [0; 1] segment (which, we know should never happen). Blue line is a result with from = 0 and to = 1, and it just clips the portion of original curve (and as result is not a density estimate itself, because total square is less than 1).

library(ggplot2)

set.seed(42)
x <- runif(100)

den_default <- as.data.frame(density(x)[c("x", "y")])
den_cut <- as.data.frame(density(x, from = 0, to = 1)[c("x", "y")])

ggplot(mapping = aes(x, y)) +
  geom_line(data = den_cut, color = "blue", size = 2) +
  geom_line(data = den_default, color = "black")

^{Created on 2020-04-30 by the reprex package (v0.3.0)}

MatthieuStigler · 2020-04-30T20:38:36Z

ok, thanks for the explanation!!

spspitze · 2020-05-17T19:16:35Z

Hello @echasnovski,

This proposed function looks really cool and useful. If you don't mind me asking, could you share your code that created it? Thanks.

Edit: Fantastic. Thank you so much.

echasnovski · 2020-05-19T16:46:23Z

Thanks, @spspitze.

I decided to make PR (#4013) for this feature to make it more publicly accessible.
Currently, to actually use it, you have two choices:

Install version of 'ggplot2' from my fork (which you might "overwrite" when updating 'ggplot2' later on) via:

# install.packages("remotes")
remotes::install_github("echasnovski/ggplot2", ref="geom_density-bounds")

Somehow use the code from my fork directly. All relevant updates are in this file.

nstjhp · 2020-11-25T20:51:36Z

Hi @echasnovski - thanks very much for this: I used your fork to good effect for a plot I was making, and also this issue made me aware of how the default clipped geom_densitys produced might not have area=1, which I find troubling.

Reading further into the subject, and for anyone else interested, I found the evmix package, and trying the examples for bckden it looks to be very good. Also their paper gives some more descriptions of the different options for dealing with boundary correction.

I suspect ultimately it will be necessary to do the computations for this outside of ggplot2 (perhaps one can use these pre-calculated densities with geom_density(stat="identity" ?), though of course much more convenient to have a bounds argument as you demonstrate!

teunbrand · 2023-01-31T19:57:48Z

This issue was fixed by #4013

paleolimbot added feature a feature request or enhancement layers 📈 labels Jul 22, 2019

yutannihilation mentioned this issue Jan 9, 2020

Set outline type for geom_density as geom_area #3708

Merged

echasnovski mentioned this issue May 19, 2020

Add bounds argument to geom_density() #4013

Merged

teunbrand closed this as completed Jan 31, 2023

wbvguo mentioned this issue Jun 3, 2023

[FeatureRequest] geom_density_ridges for bounded data wilkelab/ggridges#85

Open

eliocamp mentioned this issue Oct 25, 2023

geom_violin() for bounded data #5493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`geom_density()` for bounded data #3387

`geom_density()` for bounded data #3387

echasnovski commented Jul 2, 2019

echasnovski commented Jul 3, 2019

MatthieuStigler commented Apr 30, 2020

echasnovski commented Apr 30, 2020

MatthieuStigler commented Apr 30, 2020

spspitze commented May 17, 2020 •

edited

Loading

echasnovski commented May 19, 2020

nstjhp commented Nov 25, 2020

teunbrand commented Jan 31, 2023

geom_density() for bounded data #3387

geom_density() for bounded data #3387

Comments

echasnovski commented Jul 2, 2019

echasnovski commented Jul 3, 2019

MatthieuStigler commented Apr 30, 2020

echasnovski commented Apr 30, 2020

MatthieuStigler commented Apr 30, 2020

spspitze commented May 17, 2020 • edited Loading

echasnovski commented May 19, 2020

nstjhp commented Nov 25, 2020

teunbrand commented Jan 31, 2023

`geom_density()` for bounded data #3387

`geom_density()` for bounded data #3387

spspitze commented May 17, 2020 •

edited

Loading