New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default ECDF behavior is not empirical #1467

Closed
BrandanW opened this Issue Jan 4, 2016 · 3 comments

Comments

Projects
None yet
2 participants
@BrandanW

BrandanW commented Jan 4, 2016

I've found that when plotting an ecdf with ggplot, the two points at y=0 and y=1 are undesirable and frankly arbitrary. This is especially true for distributions with certain properties (eg, strictly positive or only in a specific range). Is there a way to get around this default behavior, or any plans to change it?

Example:

ggplot(data.frame(x = exp(1:10)),
       aes(x)) +
  geom_line(stat = "ecdf")

rplot

@felasa

This comment has been minimized.

felasa commented Jan 7, 2016

I believe stat_ecdf is meant to be used with geom_step

ggplot(data.frame(x = exp(1:10)),  aes(x)) +
 stat_ecdf()

or

ggplot(data.frame(x = exp(1:10)),  aes(x)) +
 geom_step(stat="ecdf")

both produce this

image

@BrandanW

This comment has been minimized.

BrandanW commented Jan 7, 2016

Even the step has two unneeded points, though, at about (-2000, 0) and (23700, 1). Wouldn't more rational endpoints be (min(x), 0) and (max(x), 0)? I don't see what the horizontal lines at top and bottom add.

@felasa

This comment has been minimized.

felasa commented Jan 7, 2016

While I personally have no problem with that (it's not wrong and base R does the same although with a different visualization) I see what you mean.

This can be certainly be adressed with a few changes. If if the devs don't want it implemented you can create a custom stat:

stat_myecdf <- function(mapping = NULL, data = NULL, geom = "step",
                      position = "identity", n = NULL, na.rm = FALSE,
                      show.legend = NA, inherit.aes = TRUE, direction="vh", ...) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatMyecdf,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      n = n,
      na.rm = na.rm,
      direction=direction,
      ...
    )
  )
}

StatMyecdf <- ggproto("StatMyecdf", Stat,
                    compute_group = function(data, scales, n = NULL) {

                      # If n is NULL, use raw values; otherwise interpolate
                      if (is.null(n)) {
                      # Dont understand why but this version needs to sort the values
                        xvals <- sort(unique(data$x))
                      } else {
                        xvals <- seq(min(data$x), max(data$x), length.out = n)
                      }

                      y <- ecdf(data$x)(xvals)
                      x1 <- max(xvals)
                      y0 <- 0                      
                      data.frame(x = c(xvals, x1), y = c(y0, y))
                    },

                    default_aes = aes(y = ..y..),

                    required_aes = c("x")
)

then any of:

ggplot(data=data.frame(x = exp(1:10)), aes(x)) + geom_step(stat="myecdf")
ggplot(data=data.frame(x = exp(1:10)), aes(x)) + stat_myecdf()

image

@hadley hadley closed this in 2f270c0 Jan 26, 2016

@lock lock bot locked as resolved and limited conversation to collaborators Jun 19, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.