New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs should be excluded from the level of discrete X-axis by default #1584

Closed
yutannihilation opened this Issue Mar 14, 2016 · 13 comments

Comments

Projects
None yet
3 participants
@yutannihilation
Member

yutannihilation commented Mar 14, 2016

When the variable for x is character, NAs are not treated as a break of X axis. But when the variable is factor, NAs are included.

library(ggplot2)

set.seed(1)
d <- c("A", "B", NA)
x <- sample(d, size = 100, replace = TRUE, prob = c(0.7, 0.2, 0.1))
y <- rnorm(100)
df <- data.frame(x, y)

ggplot(df) + geom_boxplot(aes(x, y), na.rm = TRUE)

ggplot(df) + geom_boxplot(aes(as.character(x), y), na.rm = TRUE)

na_included na_not_included

I feel this behaviour is inconsistent. NAs should be included only when NA is intentionally included in the level of the factor variable.

levels(df$x)
[1] "A" "B"

I guess this is simply because scales:::clevels() treats them differently. But, I'm wondering if this is by design... Why do we have to force NAs included, while we can include NA in the factor level by ourselves?

set.seed(1)
d <- c("A", "B", NA)
x <- sample(d, size = 100, replace = TRUE, prob = c(0.7, 0.2, 0.1))

scales:::clevels(x, drop = FALSE)
#> [1] "A" "B"

scales:::clevels(as.factor(x), drop = FALSE)
#> [1] "A" "B" NA 

Is it possible to eliminate NA from factor X-axis by default? (But I'm afraid this suggestion is too late and many people may already rely on this behaviour)

@yutannihilation yutannihilation changed the title from NAs are inconsistently treated when X-axis is a character/factor variable to NAs should be excluded from the level of discrete X-axis by default Mar 14, 2016

@hadley

This comment has been minimized.

Member

hadley commented Jul 28, 2016

They absolutely should be included by default. The bug is that they aren't in the character case, and that na.rm doesn't affect their appearance.

@hadley hadley added bug ready labels Jul 28, 2016

@hadley

This comment has been minimized.

Member

hadley commented Jul 28, 2016

Here's a minimal reprex of the 2x2 combinations:

df <- data.frame(
  x = c("a", "b", NA),
  y = 1:3,
  stringsAsFactors = FALSE
)

# OK
ggplot(df, aes(x, y)) + geom_point(na.rm = TRUE)
ggplot(df, aes(factor(x), y)) + geom_point()

# Incorrect: NA can be plotted
ggplot(df, aes(x, y)) + geom_point()
# Incorrect: NA should be removed
ggplot(df, aes(factor(x), y)) + geom_point(na.rm = TRUE)

@thomasp85 thomasp85 self-assigned this Aug 12, 2016

@thomasp85

This comment has been minimized.

Member

thomasp85 commented Aug 23, 2016

As per the docs na.rm should not alter the plot or the data but merely controls whether a warning is issued. Your reprex suggest a different (but in my opinion more sensible) interpretation entirely...

The reason for the discrepancy is that NA is just another level in factors and at the time the data arrives at geom point it is encoded with the integer mapping, whereas characters are still characters and NA thus persists.

@hadley Can you comment on the true purpose of na.rm before I fix this..?

@hadley

This comment has been minimized.

Member

hadley commented Aug 23, 2016

The goal should always be to display missing values on the plot, where possible. That's not possible for continuous scales, so the best we can do is display a warning. It is possible for categorical scales, so na.rm should control the appearance.

@thomasp85

This comment has been minimized.

Member

thomasp85 commented Aug 23, 2016

In that case we need to change the doc wording a bit...

na.rm:
If FALSE (the default), removes missing values with a warning. If TRUE silently removes missing values.

@hadley

This comment has been minimized.

Member

hadley commented Aug 23, 2016

Something like: If TRUE, silently removes missing values. If FALSE (the default), will remove missing continuous values with a warning; will show missing categorical values.

@hadley hadley added this to the v2.2.0 milestone Sep 23, 2016

@hadley

This comment has been minimized.

Member

hadley commented Sep 23, 2016

Is it clear what's needed?

@thomasp85

This comment has been minimized.

Member

thomasp85 commented Sep 26, 2016

Yeah - I just need to find the best way to fix this

@thomasp85

This comment has been minimized.

Member

thomasp85 commented Sep 26, 2016

Hmm this problem really sits on an intersection between ranges, scales, and geoms, in terms of where responsibility should be placed - lets talk about it in person to find out how it can be solved best...

@hadley

This comment has been minimized.

Member

hadley commented Sep 27, 2016

I think we should fix this for character vector by fixing scales::clevels() (which is a very simple change).

We then need to make sure there's some way to actually drop these NA values, because the na.rm to the layers can not work, because by the time the layer sees the data the missingness has been removed (as it's be converted to an integer position). This means we need a na.rm argument to scale_x_discrete() and scale_y_discrete(), and we need to clarify that na.rm argument to the layers only affects continuous missing values.

@hadley hadley assigned hadley and unassigned thomasp85 Oct 6, 2016

@hadley

This comment has been minimized.

Member

hadley commented Oct 6, 2016

I'll handle this one — I think it also needs some minor changes to non-positional discrete scales for consistency.

@hadley hadley closed this in db2cbba Oct 6, 2016

@hadley

This comment has been minimized.

Member

hadley commented Oct 6, 2016

@yutannihilation I know I didn't implement what you requested, but hopefully now the principles are obvious, and they're consistently implemented for the three types of discrete missing value and both types of scale.

@yutannihilation

This comment has been minimized.

Member

yutannihilation commented Oct 8, 2016

@hadley Thanks for your great fix! (and sorry for being quiet as I couldn't follow this complicated discussion...) I love this consistency 👍

@lock lock bot locked as resolved and limited conversation to collaborators Jun 19, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.