Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spreading NA keys makes NA column names #68

Closed
gregmacfarlane opened this issue Mar 18, 2015 · 11 comments
Closed

Spreading NA keys makes NA column names #68

gregmacfarlane opened this issue Mar 18, 2015 · 11 comments

Comments

@gregmacfarlane
Copy link

@gregmacfarlane gregmacfarlane commented Mar 18, 2015

Trying to create a cross tabulation using tally() combined with spread() works nicely, unless there are missing values in the key columns.

mtcars %>% 
  group_by(cyl, gear) %>%
  tally() %>%
  spread(gear, n, fill = 0)
Source: local data frame [3 x 4]

  cyl  3 4 5
1   4  1 8 2
2   6  2 4 1
3   8 12 0 2

mtcars %>%
  mutate(
    cyl = ifelse(cyl > 6, NA, cyl),
    gear = ifelse(gear > 4, NA, gear)
  ) %>%
  group_by(cyl, gear) %>%
  tally()
Source: local data frame [8 x 3]
Groups: cyl

  cyl gear  n
1   4    3  1
2   4    4  8
3   4   NA  2
4   6    3  2
5   6    4  4
6   6   NA  1
7  NA    3 12
8  NA   NA  2

# DITTO # %>%
  spread(gear, n)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

It seems like a good solution would be to either treat NA values as another key or to drop them from the resulting table, controlled by an option switch. Like useNA = "always" in the base table() function, but with better syntax.

@zross
Copy link

@zross zross commented Apr 1, 2015

Thanks for adding this issue. I second it. At a minimum a warning "There cannot be NA values in the key variable" would be useful. As always, thanks for this very useful package Hadley.

Loading

@hadley
Copy link
Member

@hadley hadley commented May 13, 2015

Simpler example:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  tbl_df() %>%
  spread(x, z)

Note that it's actually a dplyr problem since this works:

data.frame(x = c(1, 2, NA), y = 1, z = c(1, 2, 3)) %>%
  spread(x, z)

Should fix by turning NA column names into something valid <NA> ?

Loading

@zross
Copy link

@zross zross commented May 14, 2015

I think that makes sense. Or an argument that allows the user to decide along the lines of useNA=c(TRUE, FALSE) and if TRUE then turn NA to something valid.

Loading

@jennybc
Copy link
Member

@jennybc jennybc commented May 19, 2015

I just tripped on this same thing. Before I found this issue, I had made my own example, so will include here anyway. As for what to name the columns, it seems like an option to prepend the key name would be useful here and more generally.

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 pat,   2,     NA,  24
 pam,   2, female,  63
 peg,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

Cross-tabulating and spreading with (sex, grp) works:

> jdat %>%
+ count(sex, grp) %>%
+ spread(grp, n, fill = 0)
Source: local data frame [3 x 3]

     sex 1 2
1 female 1 2
2   male 1 0
3     NA 1 1

But going the other way round, (grp, sex), does not:

> jdat %>%
+ count(grp, sex) %>%
+ spread(sex, n, fill = 0)
Error in if (any(names2(x) == "")) { : 
  missing value where TRUE/FALSE needed

To illustrate my proposal re column names, here's possible output:

grp sex_female sex_male sex_NA
  1          1        1      1
  2          2        0      1

The prepending would fix up the value of NA as a column name and would also help when the values are cryptic, e.g. naked integers like cyl and gear from mtcars.

Loading

@gregmacfarlane
Copy link
Author

@gregmacfarlane gregmacfarlane commented May 19, 2015

@jennybc I like the idea of prepending. It will resolve the problem where kable(spread()) needs some in-text explanation as to what it is showing.

grp          1        2      3
  1          1        1      1
  2          2        0      1

Loading

@hadley
Copy link
Member

@hadley hadley commented May 19, 2015

@jennybc Seems like most of the time you'll want the key name prepended if the value is non-character, but most of the time you won't want it prepended if it is a character. So maybe something like include_key = !is.character(value) ?

Loading

@jennybc
Copy link
Member

@jennybc jennybc commented May 19, 2015

Yes, @hadley, I think you're right about when prepending the key is most desirable. That, together with the ability to have a variable/column named <NA>, would represent a nice gain in functionality for spread().

Loading

@hadley
Copy link
Member

@hadley hadley commented Dec 30, 2015

This no longer fails for me, presumably because dplyr does a better job of handling columns names that are missing.

I'm a bit leery of automatically changing the column names, because traditionally that's been done in a separate step (which is long but makes things more explicit, and sticks to the principle of each function doing one thing).

Loading

@hadley hadley changed the title spread with missing keys returns error spread with missing puts NA in column names Dec 30, 2015
@hadley hadley changed the title spread with missing puts NA in column names Spreading missing keys makes missing column names Dec 30, 2015
@hadley hadley changed the title Spreading missing keys makes missing column names Spreading NA keys makes NA column names Dec 30, 2015
@hadley hadley closed this in 559b732 May 22, 2016
@hadley
Copy link
Member

@hadley hadley commented May 22, 2016

Implemented with sep argument. If anyone strongly dislikes this API, please speak up ASAP!

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y) %>% 
  names()
#> [1] "1"    "<NA>"

data_frame(x = c(1, NA), y = 1:2) %>% 
  spread(x, y, sep = "_") %>% 
  names()
#> [1] "x_1"  "x_NA"

Loading

@zross
Copy link

@zross zross commented May 23, 2016

A very useful addition!

Loading

@ghost
Copy link

@ghost ghost commented Feb 2, 2018

I think the naming of columns would be more intuitive if the default name when sep is specified was "<value_name><key_value>" to edit the above example:

jtxt <- "
name, grp,    sex, age
 joe,   1,   male,  30
 jan,   1,     NA,  43
 jen,   1, female,  44
 joe,   2,     NA,  24
 jan,   2, female,  63
 jen,   2, female,  50
"
jdat <- read.csv(text = jtxt, stringsAsFactors = FALSE, strip.white = TRUE)

If we spread this:

spread(jdat,grp,sex,sep = "_")
 name age  grp_1  grp_2
1  jan  43   <NA>   <NA>
2  jan  63   <NA> female
3  jen  44 female   <NA>
4  jen  50   <NA> female
5  joe  24   <NA>   <NA>
6  joe  30   male   <NA>

grp_1 is not a very descriptive title where as sex_1 would be more helpful.

It would also help make spread() more useful if you could spread more than one variable. In this example the age variable has been separated from group and it not longer makes sense. Alternatively, if there was an option ie keep_key=TRUE which would allow you to keep the grp column intact then you could spread the age column separately.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants