Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spreading NA keys makes NA column names #68

Closed
gregmacfarlane opened this issue Mar 18, 2015 · 11 comments

Comments

@gregmacfarlane
Copy link

commented Mar 18, 2015

Trying to create a cross tabulation using tally() combined with spread() works nicely, unless there are missing values in the key columns.

mtcars %>% 
  group_by(cyl, gear) %>%
  tally() %>%
  spread(gear, n, fill = 0)
Source: local data frame [3 x 4]

  cyl  3 4 5
1   4  1 8 2
2   6  2 4 1
3   8 12 0 2

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally()
Source: local data frame [8 x 3]
Groups: cyl

  cyl gear  n
1   4    3  1
2   4    4  8
3   4   NA  2
4   6    3  2
5   6    4  4
6   6   NA  1
7  NA    3 12
8  NA   NA  2

# DITTO # %>%
  spread(gear, n)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

It seems like a good solution would be to either treat NA values as another key or to drop them from the resulting table, controlled by an option switch. Like useNA = "always" in the base table() function, but with better syntax.

@zross

This comment has been minimized.

Copy link

commented Apr 1, 2015

Thanks for adding this issue. I second it. At a minimum a warning "There cannot be NA values in the key variable" would be useful. As always, thanks for this very useful package Hadley.

@hadley

This comment has been minimized.

Copy link
Member

commented May 13, 2015

Simpler example:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  tbl_df() %>%
  spread(x, z)

Note that it's actually a dplyr problem since this works:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  spread(x, z)

Should fix by turning NA column names into something valid <NA> ?

@zross

This comment has been minimized.

Copy link

commented May 14, 2015

I think that makes sense. Or an argument that allows the user to decide along the lines of useNA=c(TRUE, FALSE) and if TRUE then turn NA to something valid.

@jennybc

This comment has been minimized.

Copy link
Member

commented May 19, 2015

I just tripped on this same thing. Before I found this issue, I had made my own example, so will include here anyway. As for what to name the columns, it seems like an option to prepend the key name would be useful here and more generally.

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 pat,   2,     NA,  24
 pam,   2, female,  63
 peg,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

Cross-tabulating and spreading with (sex, grp) works:

> jdat %>%
+ count(sex, grp) %>%
+ spread(grp, n, fill = 0)
Source: local data frame [3 x 3]

     sex 1 2
1 female 1 2
2   male 1 0
3     NA 1 1

But going the other way round, (grp, sex), does not:

> jdat %>%
+ count(grp, sex) %>%
+ spread(sex, n, fill = 0)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

To illustrate my proposal re column names, here's possible output:

grp sex_female sex_male sex_NA
  1          1        1      1
  2          2        0      1

The prepending would fix up the value of NA as a column name and would also help when the values are cryptic, e.g. naked integers like cyl and gear from mtcars.

@gregmacfarlane

This comment has been minimized.

Copy link
Author

commented May 19, 2015

@jennybc I like the idea of prepending. It will resolve the problem where kable(spread()) needs some in-text explanation as to what it is showing.

grp          1        2      3
  1          1        1      1
  2          2        0      1
@hadley

This comment has been minimized.

Copy link
Member

commented May 19, 2015

@jennybc Seems like most of the time you'll want the key name prepended if the value is non-character, but most of the time you won't want it prepended if it is a character. So maybe something like include_key = !is.character(value) ?

@jennybc

This comment has been minimized.

Copy link
Member

commented May 19, 2015

Yes, @hadley, I think you're right about when prepending the key is most desirable. That, together with the ability to have a variable/column named <NA>, would represent a nice gain in functionality for spread().

@hadley

This comment has been minimized.

Copy link
Member

commented Dec 30, 2015

This no longer fails for me, presumably because dplyr does a better job of handling columns names that are missing.

I'm a bit leery of automatically changing the column names, because traditionally that's been done in a separate step (which is long but makes things more explicit, and sticks to the principle of each function doing one thing).

@hadley hadley changed the title `spread` with missing keys returns error `spread` with missing puts NA in column names Dec 30, 2015

@hadley hadley changed the title `spread` with missing puts NA in column names Spreading missing keys makes missing column names Dec 30, 2015

@hadley hadley changed the title Spreading missing keys makes missing column names Spreading NA keys makes NA column names Dec 30, 2015

@hadley hadley closed this in 559b732 May 22, 2016

@hadley

This comment has been minimized.

Copy link
Member

commented May 22, 2016

Implemented with sep argument. If anyone strongly dislikes this API, please speak up ASAP!

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y) %>% 
  names()
#> [1] "1"    "<NA>"

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y, sep = "_") %>% 
  names()
#> [1] "x_1"  "x_NA"
@zross

This comment has been minimized.

Copy link

commented May 23, 2016

A very useful addition!

@sarahend

This comment has been minimized.

Copy link

commented Feb 2, 2018

I think the naming of columns would be more intuitive if the default name when sep is specified was "<value_name><key_value>" to edit the above example:

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 joe,   2,     NA,  24
 jan,   2, female,  63
 jen,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

If we spread this:

spread(jdat,grp,sex,sep = "_")
 name age  grp_1  grp_2
1  jan  43   <NA>   <NA>
2  jan  63   <NA> female
3  jen  44 female   <NA>
4  jen  50   <NA> female
5  joe  24   <NA>   <NA>
6  joe  30   male   <NA>

grp_1 is not a very descriptive title where as sex_1 would be more helpful.

It would also help make spread() more useful if you could spread more than one variable. In this example the age variable has been separated from group and it not longer makes sense. Alternatively, if there was an option ie keep_key=TRUE which would allow you to keep the grp column intact then you could spread the age column separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.