NAs in colored scatter plots are misleading #410

Closed
ifellows opened this Issue Feb 21, 2012 · 5 comments

Comments

Projects
None yet
3 participants

Consider:

dat <- data.frame(a = 1:10,b=1:10,c=c(1:9,NA))
ggplot() +
geom_point(aes(x = a,y = b,colour = c),data=dat) +
scale_colour_gradient(guide = guide_legend(),low = '#0',high = '#e3e1e2')

There is no indication in the legend what NA values look like, and even if there was, the reader would not be able to distinguish it from the color scale. It seems to me that NAs need either a special plotting symbol (perhaps an open point with "NA" written inside it), or they should be dropped with a warning.

Owner

hadley commented Feb 21, 2012

But they do get a different colour...

Not if the scale includes that color (as in the example). Even in the default gradient, the default NA color is close enough to the scale to cause confusion. The NA color can of course be changed explicitly.

Collaborator

kohske commented Feb 22, 2012

Actually ggplot2 doesn't know what NA means.

In my view, ggplot2 would display warning messages but try to draw by using na.default.
If users need to set color/symbol for NA, they can do by changes na.default.
Maybe the legend should have an option such as display.NA. This is easy now.

In generating legend, handling NA is really annoying.
Just for note, IIRC now ggplot2 (and scales) uses internally NA for, e.g., out-of-range value.
It will induce some problems.

Owner

hadley commented Jun 8, 2012

I think the best resolution for this would be for the legend to display an entry for NA when there are any on the graph - the default colours might make it difficult to distinguish the missing value from other values, but at least you'd know it was there and could remedy the problem.

The challenge, as @kohske mentioned, is that there are two meanings for NA: an explicit NA from the original data (which we want to preserve) or a NA introduced because the value is outside the range of the scale (we don't want to show that on the plot, but it should trigger an error message). One option which is not particularly elegant, would be to use NA and NaN to distinguish between the two states. Otherwise I could try creating a new vector type that could have multiple types of missing values (a hybrid continuous/discrete variable).

Owner

hadley commented Feb 24, 2014

This sounds like a great feature, but unfortunately we don't currently have the development bandwidth to support it. If you'd like to submit a pull request that implements this feature, please follow the instructions in the development vignette.

hadley closed this Feb 24, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment