Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dodging algorithm for box plots #2196

Merged
merged 47 commits into from Jul 28, 2017
Merged

Conversation

karawoo
Copy link
Member

@karawoo karawoo commented Jul 6, 2017

Fixes #2143. This creates a new ggproto object for box plot positions, PositionBoxdodge, and a function pos_boxdodge() that can handle regular and variable-width boxes. collide() gets split into several functions. collide_setup() does the initial setup of the data and gets called by both collide() (which behaves the same as it has been) or collide_box().

@hadley this is slightly different than the algorithm we talked about because it doesn't use a for loop to find the non-overlapping groups. The x value in the data already seems to capture that information, so I think we can avoid the loop, but maybe I'm missing something...

With these changes, box plots look like this:

library("ggplot2")

ggplot(data = iris, aes(Species, Sepal.Length)) +
  geom_boxplot(aes(colour = Sepal.Width < 3.2))

ggplot(data = iris, aes(Species, Sepal.Length)) +
  geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = TRUE)

@karawoo karawoo requested a review from hadley July 6, 2017 21:19
@hadley
Copy link
Member

hadley commented Jul 6, 2017

Can you include an example with continuous x? I think that's what I was thinking of for the for loop.

@hadley
Copy link
Member

hadley commented Jul 6, 2017

It also seems like we need a little padding between the individual elements. Could we make that a parameter?

@karawoo
Copy link
Member Author

karawoo commented Jul 6, 2017

Ohhh continuous x. Yeah you're right, this doesn't work well for that.

library("ggplot2")

## Without varwidth = TRUE it's ok aside from the padding issue
ggplot(diamonds, aes(y = price)) +
  geom_boxplot(aes(x = cut_width(carat, 0.5), fill = depth < 65))

## With the changes from this PR and varwidth = TRUE
ggplot(diamonds, aes(y = price)) +
  geom_boxplot(aes(x = cut_width(carat, 0.5), fill = depth < 65), varwidth = TRUE)

Here's how the second plot looks with the current dev version of ggplot.

ggplot(diamonds, aes(y = price)) +
  geom_boxplot(aes(x = cut_width(carat, 0.5), fill = depth < 65), varwidth = TRUE)
#> Warning: position_dodge requires non-overlapping x intervals

for loop it is, then!

@karawoo
Copy link
Member Author

karawoo commented Jul 9, 2017

Oh, and there is a larger problem with this: each pair of boxes is getting scaled separately to the group width, which makes the varying widths from varwidth meaningless.

@mcol
Copy link
Contributor

mcol commented Jul 10, 2017

Note that the order of the bars in the second example of the OP is different from the rest (I've confirmed it with a checkout of your current position-dodge branch).

@karawoo
Copy link
Member Author

karawoo commented Jul 11, 2017

Thanks for pointing that out, @mcol. Right now the boxes get placed in the order in which they appear in the data passed to pos_boxdodge() and I need to change that to enforce consistency across groups of boxes.

@karawoo
Copy link
Member Author

karawoo commented Jul 11, 2017

Also, my example above doesn't really have a continuous x variable since cut_width() converts it to a factor. Here is a better example for testing with (that is still broken):

set.seed(582)
dat <- data.frame(x = sample(4:5, size = 20, replace = TRUE),
                  y = rnorm(20),
                  class = sample(c("a", "b"), size = 20, replace = TRUE))

ggplot(dat, aes(x = x, y = y)) +
  geom_boxplot(aes(group = interaction(x, class), fill = class), varwidth = TRUE)

@karawoo
Copy link
Member Author

karawoo commented Jul 14, 2017

Alright, here is the new box plot algorithm. Below are a bunch of examples of how it looks in practice with continuous and discrete x variables and various combinations of varwidth = TRUE, varwidth = FALSE, preserve = "single", and preserve = "total". Right now preserve = "single" is the default for all box plots, but I could change that so that it's only the default when varwidth = TRUE. This definitely needs some more tests, which I will add soon.

library("ggplot2")

## From the original example:
ggplot(data = iris, aes(Species, Sepal.Length)) +
  geom_boxplot(aes(colour = Sepal.Width < 3.2))

ggplot(data = iris, aes(Species, Sepal.Length)) +
  geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = TRUE)

## Diamonds data -- padding between boxes is less noticeable
ggplot(diamonds, aes(y = price)) +
  geom_boxplot(aes(x = cut_width(carat, 0.5), fill = depth < 65), varwidth = TRUE)

 
## A plot with truly continuous x
set.seed(582)
dat <- data.frame(x = sample(4:5, size = 20, replace = TRUE),
                  y = rnorm(20),
                  class = sample(c("a", "b"), size = 20, replace = TRUE))

ggplot(dat, aes(x = x, y = y)) +
  geom_boxplot(aes(group = interaction(x, class), fill = class))

## Truly continuous x and varwidth = TRUE
ggplot(dat, aes(x = x, y = y)) +
  geom_boxplot(aes(group = interaction(x, class), fill = class), varwidth = TRUE)

## Preserve total width
ggplot(mtcars, aes(factor(cyl), y = mpg, fill = factor(vs))) +
  geom_boxplot(position = position_boxdodge(preserve = "total"))

## Preserve individual width with varwidth = TRUE
ggplot(mtcars, aes(factor(cyl), y = mpg, fill = factor(vs))) +
  geom_boxplot(position = position_boxdodge(preserve = "single"), varwidth = TRUE)

## preserve = "total" is incompatible with varwidth = TRUE
ggplot(mtcars, aes(factor(cyl), y = mpg, fill = factor(vs))) +
  geom_boxplot(position = position_boxdodge(preserve = "total"), varwidth = TRUE)
#> Warning: Can't preserve total widths when varwidth = TRUE.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results look great!

@@ -0,0 +1,116 @@
#' Position dodge for box plots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to give this a name that reflects that it works for any geom with variable widths - i.e. it would also work for geom_rect().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

position_vardodge()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe position_flexdodge(), since it's a more flexible dodge that can handle variable width boxes and arbitrary rectangles?

#' @export
#' @examples
#' ggplot(data = iris, aes(Species, Sepal.Length)) +
#' geom_boxplot(aes(colour = Sepal.Width < 3.2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing indent

}

# xid represents groups of boxes that share the same position
df$xid <- match(df$x, sort(unique(df$x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is why it won't work with geom_rect(). It's only a few lines of code, so I think it's worth using the for-loop that I originally proposed

df$xmax <- df$x + (df$new_width / 2)

# Find the total width of each group of boxes
group_sizes <- plyr::ddply(df, "xid", plyr::summarize, size = sum(new_width))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be replaced by a tapply()? I'd prefer to not use plyr in new code, since one day I'd like to eliminate the dependency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah sure. aggregate() might be even better since it'll return a data frame.

}

# x values get moved to between xmin and xmax
df$x <- rowMeans(df[, c("xmin", "xmax")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think df$x <- (df$xmin + df$xmax) / 2 would be clearer

overlaps <- vector(length = nrow(d) - 1)
for (i in 2:nrow(d)) {
if (d$xmin[i] < d$xmax[i - 1]) {
print(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this print()


overlaps <- vector(length = nrow(d) - 1)
for (i in 2:nrow(d)) {
if (d$xmin[i] < d$xmax[i - 1]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop could become (with a small modification) a separate group_overlapping function, and you could use above.

@hadley
Copy link
Member

hadley commented Jul 14, 2017

The default padding looks slightly too big to my eyes — maybe try 0.05.

It would also be handy to show a couple of examples using with other geoms (i.e. geom_bar())

@karawoo
Copy link
Member Author

karawoo commented Jul 20, 2017

I did some more research and it'll take a bit more work before position_dodge2() can be used in all the places dodge is currently used. Right now it doesn't work well with dot plots, violins, or error bars.

@hadley
Copy link
Member

hadley commented Jul 21, 2017

Ok. Let's make it the default for boxplots and bars and then take a look at the others later.

@karawoo
Copy link
Member Author

karawoo commented Jul 25, 2017

I've made some slight changes to hopefully make this as easy to use for both boxes and bars as possible. position_dodge2() now defaults to having 0 padding, making it work nicely with bars -- you just have to add position = "dodge2" to override the default stacking position. geom_boxplot() uses a default position of position_dodge2(padding = 0.05), so boxes still get padding by default.

Something has gone wrong again with the ordering of the boxes/bars compared to the legend, I'll track that down and fix it.

ggplot(data = iris, aes(Species, Sepal.Length)) +
  geom_boxplot(aes(colour = Sepal.Width < 3.2), varwidth = TRUE)

ggplot(mtcars, aes(factor(cyl), fill = factor(vs))) +
  geom_bar(position = "dodge2")

@hadley
Copy link
Member

hadley commented Jul 25, 2017

This is looking great! (Although now padding of 0.05 looks too small. Maybe switch back to 0.10?)

@karawoo
Copy link
Member Author

karawoo commented Jul 25, 2017

Soooo close. I am rethinking the defaults that I changed above. Right now this adds padding between the boxes because geom_boxplot's default position is position_dodge2(padding = 0.1):

ggplot(mtcars, aes(factor(cyl), y = mpg, fill = factor(vs))) +
  geom_boxplot()

But if someone wants to change the preserve argument, they're likely going to do:

ggplot(mtcars, aes(factor(cyl), y = mpg, fill = factor(vs))) +
  geom_boxplot(position = position_dodge2(preserve = "total"))

which results in no padding (they'd need to also add padding = 0.1). So maybe it's better to have position_dodge2() default to adding padding, which will be consistent for box plots even if other position parameters get changed. And for bars, people will have to override that default if they don't want space between them.

@hadley
Copy link
Member

hadley commented Jul 26, 2017

Yeah, I think that's a good idea. I rather like the padding between bars too.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge once you've made the news and doc tweaks.

NEWS.md Outdated
@@ -1,5 +1,8 @@
# ggplot2 2.2.1.9000

* Box plot position is now controlled by `position_dodge2()` (@karawoo,
#2143).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And bar. And you should add a brief descrption.

@@ -0,0 +1,154 @@
#' Alternate method for dodging overlapping objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth documenting together with position_dodge()?

@lock
Copy link

lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants