Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs should be excluded from the level of discrete X-axis by default #1584

Closed
yutannihilation opened this issue Mar 14, 2016 · 13 comments
Closed

NAs should be excluded from the level of discrete X-axis by default #1584

yutannihilation opened this issue Mar 14, 2016 · 13 comments
Assignees
Labels
bug
Milestone

Comments

@yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented Mar 14, 2016

When the variable for x is character, NAs are not treated as a break of X axis. But when the variable is factor, NAs are included.

library(ggplot2)

set.seed(1)
d <- c("A", "B", NA)
x <- sample(d, size = 100, replace = TRUE, prob = c(0.7, 0.2, 0.1))
y <- rnorm(100)
df <- data.frame(x, y)

ggplot(df) + geom_boxplot(aes(x, y), na.rm = TRUE)

ggplot(df) + geom_boxplot(aes(as.character(x), y), na.rm = TRUE)

na_included na_not_included

I feel this behaviour is inconsistent. NAs should be included only when NA is intentionally included in the level of the factor variable.

levels(df$x)
[1] "A" "B"

I guess this is simply because scales:::clevels() treats them differently. But, I'm wondering if this is by design... Why do we have to force NAs included, while we can include NA in the factor level by ourselves?

set.seed(1)
d <- c("A", "B", NA)
x <- sample(d, size = 100, replace = TRUE, prob = c(0.7, 0.2, 0.1))

scales:::clevels(x, drop = FALSE)
#> [1] "A" "B"

scales:::clevels(as.factor(x), drop = FALSE)
#> [1] "A" "B" NA 

Is it possible to eliminate NA from factor X-axis by default? (But I'm afraid this suggestion is too late and many people may already rely on this behaviour)

@yutannihilation yutannihilation changed the title NAs are inconsistently treated when X-axis is a character/factor variable NAs should be excluded from the level of discrete X-axis by default Mar 14, 2016
@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2016

They absolutely should be included by default. The bug is that they aren't in the character case, and that na.rm doesn't affect their appearance.

@hadley hadley added bug ready labels Jul 28, 2016
@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2016

Here's a minimal reprex of the 2x2 combinations:

df <- data.frame(
  x = c("a", "b", NA),
  y = 1:3,
  stringsAsFactors = FALSE
)

# OK
ggplot(df, aes(x, y)) + geom_point(na.rm = TRUE)
ggplot(df, aes(factor(x), y)) + geom_point()

# Incorrect: NA can be plotted
ggplot(df, aes(x, y)) + geom_point()
# Incorrect: NA should be removed
ggplot(df, aes(factor(x), y)) + geom_point(na.rm = TRUE)
@thomasp85 thomasp85 self-assigned this Aug 12, 2016
@thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Aug 23, 2016

As per the docs na.rm should not alter the plot or the data but merely controls whether a warning is issued. Your reprex suggest a different (but in my opinion more sensible) interpretation entirely...

The reason for the discrepancy is that NA is just another level in factors and at the time the data arrives at geom point it is encoded with the integer mapping, whereas characters are still characters and NA thus persists.

@hadley Can you comment on the true purpose of na.rm before I fix this..?

@hadley
Copy link
Member

@hadley hadley commented Aug 23, 2016

The goal should always be to display missing values on the plot, where possible. That's not possible for continuous scales, so the best we can do is display a warning. It is possible for categorical scales, so na.rm should control the appearance.

@thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Aug 23, 2016

In that case we need to change the doc wording a bit...

na.rm:
If FALSE (the default), removes missing values with a warning. If TRUE silently removes missing values.

@hadley
Copy link
Member

@hadley hadley commented Aug 23, 2016

Something like: If TRUE, silently removes missing values. If FALSE (the default), will remove missing continuous values with a warning; will show missing categorical values.

@hadley hadley added this to the v2.2.0 milestone Sep 23, 2016
@hadley
Copy link
Member

@hadley hadley commented Sep 23, 2016

Is it clear what's needed?

@thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Sep 26, 2016

Yeah - I just need to find the best way to fix this

@thomasp85
Copy link
Member

@thomasp85 thomasp85 commented Sep 26, 2016

Hmm this problem really sits on an intersection between ranges, scales, and geoms, in terms of where responsibility should be placed - lets talk about it in person to find out how it can be solved best...

@hadley
Copy link
Member

@hadley hadley commented Sep 27, 2016

I think we should fix this for character vector by fixing scales::clevels() (which is a very simple change).

We then need to make sure there's some way to actually drop these NA values, because the na.rm to the layers can not work, because by the time the layer sees the data the missingness has been removed (as it's be converted to an integer position). This means we need a na.rm argument to scale_x_discrete() and scale_y_discrete(), and we need to clarify that na.rm argument to the layers only affects continuous missing values.

@hadley hadley assigned hadley and unassigned thomasp85 Oct 6, 2016
@hadley
Copy link
Member

@hadley hadley commented Oct 6, 2016

I'll handle this one — I think it also needs some minor changes to non-positional discrete scales for consistency.

@hadley hadley closed this in db2cbba Oct 6, 2016
@hadley
Copy link
Member

@hadley hadley commented Oct 6, 2016

@yutannihilation I know I didn't implement what you requested, but hopefully now the principles are obvious, and they're consistently implemented for the three types of discrete missing value and both types of scale.

@yutannihilation
Copy link
Member Author

@yutannihilation yutannihilation commented Oct 8, 2016

@hadley Thanks for your great fix! (and sorry for being quiet as I couldn't follow this complicated discussion...) I love this consistency 👍

@lock lock bot locked as resolved and limited conversation to collaborators Jun 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants