Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA handling in unite #203

Closed
voxnonecho opened this issue Jun 13, 2016 · 30 comments
Closed

NA handling in unite #203

voxnonecho opened this issue Jun 13, 2016 · 30 comments

Comments

@voxnonecho
Copy link

@voxnonecho voxnonecho commented Jun 13, 2016

Consider the following df:

ID   d1   d2   
1    G    G
2    A    G
3    A    A
4    G    A
5    NA   NA
6    G    G

When uniting d1 and d2:

tidyr::unite(df, new, d1, d2, remove = FALSE, sep = "")

Row 5 gives NANA instead of the expected NA

  ID  new   d1   d2
1  1   GG    G    G
2  2   AG    A    G
3  3   AA    A    A
4  4   GA    G    A
5  5 NANA <NA> <NA>
6  6   GG    G    G
@voxnonecho voxnonecho changed the title Strange way of handling NA in unite NA handling in unite Jun 13, 2016
@hadley

This comment has been hidden.

@voxnonecho

This comment has been hidden.

@hadley

This comment has been hidden.

@voxnonecho
Copy link
Author

@voxnonecho voxnonecho commented Jun 13, 2016

Well, I think unite() should work like paste() but could maybe provide an additional argument to handle NAs, à la na.rm = TRUE

@danrlu
Copy link

@danrlu danrlu commented Sep 8, 2016

I think in some cases the omit NA option could be useful. My df has many columns that contain mostly NA, as a result of multiple rounds of join.

recipe  potato  tomato  cucumber    rock
A       potato  NA      cucumber    NA
B       NA      NA      NA          rock
C       NA      tomato  NA          NA
...

So I was trying to combine the columns into one and remove the NA to see things better.

recipe  ingredients
A       potato,cucumber
B       rock
C       tomato
...

The solution is not hard, just not quite as tidy.

@tjohnson250

This comment has been hidden.

@jennybc
Copy link
Member

@jennybc jennybc commented Oct 27, 2016

This is not the requested solution, but a clean way to get the desired result is:

library(tidyverse)
df <- tribble(
  ~ID, ~d1, ~d2,   
    1, "G", "G",
    2, "A", "G",
    3, "A", "A",
    4, "G", "A",
    5,  NA,  NA,
    6, "G", "G")
df %>% 
  replace_na(list(d1 = "", d2 = "")) %>% 
  unite(new, d1, d2, remove = FALSE, sep = "")
#> # A tibble: 6 × 4
#>      ID   new    d1    d2
#> * <dbl> <chr> <chr> <chr>
#> 1     1    GG     G     G
#> 2     2    AG     A     G
#> 3     3    AA     A     A
#> 4     4    GA     G     A
#> 5     5                  
#> 6     6    GG     G     G

@e-clin

This comment has been hidden.

@alexpghayes

This comment has been hidden.

@alistaire47
Copy link

@alistaire47 alistaire47 commented Sep 11, 2017

I'm not convinced that unite should work like paste, as it's a very rare situation when a user would actually want to turn NA values into strings. More concerningly, in terms of API consistency separate will introduce NAs in a way that unite can't reverse:

library(tidyr)

example <- tibble::data_frame(x = c('foo', 'foo bar', 'foo bar baz'))

example %>% separate(x, c('foo', 'bar', 'baz'), fill = 'right')    # without `fill = 'right'` same result with a message 
#> # A tibble: 3 x 3
#>     foo   bar   baz
#> * <chr> <chr> <chr>
#> 1   foo  <NA>  <NA>
#> 2   foo   bar  <NA>
#> 3   foo   bar   baz

example %>% 
    separate(x, c('foo', 'bar', 'baz'), fill = 'right') %>% 
    unite(x, foo:baz, sep = ' ')
#> # A tibble: 3 x 1
#>             x
#> *       <chr>
#> 1   foo NA NA
#> 2  foo bar NA
#> 3 foo bar baz

If NAs are in the middle of columns that get united and then separated then paste-like behavior would allow the NA location to be saved (at the cost of requiring them to be converted from strings to actual NA again), but most of the time the NA handling keeps the functions from being inverses. Making na.rm = TRUE the default would be a breaking change, but probably not one that would break much code.

@hadley
Copy link
Member

@hadley hadley commented Nov 16, 2017

There are actually two feature requests in this thread:

  1. Make NAs infections so that if any input is NA, then the output is NA
  2. Provide an easy way to drop NAs.

2. seems like the more useful option so I will implement that.

@alexpghayes the plan is to extract out a general helper for turning the vectorised functions that power many tidyr functions in a tibblicious version

@hadley

This comment has been hidden.

@moodymudskipper

This comment was marked as off-topic.

@hadley

This comment has been hidden.

@hadley
Copy link
Member

@hadley hadley commented Mar 7, 2019

Minimal reprex

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
unite(df, z, c("x", "y"), remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a_NA  a     <NA> 
#> 3 NA_b  <NA>  b    
#> 4 NA_NA <NA>  <NA>

Created on 2019-03-07 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

@hadley hadley commented Mar 7, 2019

Note that you'll need na.rm = TRUE (I left the default as is to preserve backward compatibility since it seems like many people have probably worked around the previous behaviour in various way)

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a     a     <NA> 
#> 3 b     <NA>  b    
#> 4 ""    <NA>  <NA>

Created on 2019-03-07 by the reprex package (v0.2.1.9000)

@hadley hadley closed this in 58df41d Mar 7, 2019
@kasperav
Copy link

@kasperav kasperav commented Mar 28, 2019

Hi @hadley ,

I am having trouble getting na.rm = TRUE to work within the unite() function.

I tried the following:

  1. Update R from 3.5.1 to 3.5.3
  2. Delete the old tidyverse and tidyr packages
  3. install fresh tidyverse package
  4. run the following code:
> library("tidyr")
> df <- expand.grid(x = c("a", NA), y = c("b", NA))
> df
     x    y
1    a    b
2 <NA>    b
3    a <NA>
4 <NA> <NA>
> df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Which gives me this error:

Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Backtracing error:

> rlang::last_error()
<error>
message: `TRUE` must evaluate to column positions or names, not a logical vector
class:   `rlang_error`
backtrace:
  1. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10. tidyselect::vars_select(colnames(data), ...)
 11. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 12. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
 13. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
Call `rlang::last_trace()` to see the full backtrace

> rlang::last_trace()
     x
  1. \-df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
  2.   +-base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  3.   \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  4.     \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5.       \-global::`_fseq`(`_lhs`)
  6.         \-magrittr::freduce(value, `_function_list`)
  7.           +-base::withVisible(function_list[[k]](value))
  8.           \-function_list[[k]](value)
  9.             +-tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10.             \-tidyr:::unite.data.frame(., "z", x:y, na.rm = TRUE, remove = FALSE)
 11.               \-tidyselect::vars_select(colnames(data), ...)
 12.                 \-tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 13.                   \-tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)

@hadley
Copy link
Member

@hadley hadley commented Mar 28, 2019

@kasperav you probably have not installed the development version of tidyr.

@kasperav
Copy link

@kasperav kasperav commented Mar 28, 2019

@hadley you are right! I have no luck with installing the dev version, so I'll wait for this to be implemented in a CRAN version of tidyr :)

@jameshowison
Copy link

@jameshowison jameshowison commented Dec 27, 2019

FWIW, I found the behavior where unite takes two NA values and produces an empty string to be very confusing and unexpected. Seems clear to me that uniting two NA values should produce an NA value.

I'm guessing this is clearer to people who have used paste a lot :) Simple to fix up with a na_if("") (but one has to hope that empty string wasn't a meaningful value distinct from _NA_character in the original columns!)

@lindsayplatt
Copy link

@lindsayplatt lindsayplatt commented Feb 25, 2020

I have a use case where I need to use na.rm = TRUE and unite for 8 columns. One of the columns is all NA. Using na.rm = T with unite seems to have different behavior when one of the columns is all NA. Is that expected behavior? Should I just ignore columns that are all NA before using unite?

library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)

# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 a_NA  a     NA   
2 a_NA  a     NA   
3 NA    NA    NA   
4 NA    NA    NA 

@jzadra
Copy link

@jzadra jzadra commented Feb 25, 2020

What version are you using? That's not the result I get (on 1.0.2.9000)

suppressPackageStartupMessages(require(tidyverse))
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = T, remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <lgl>
#> 1 "a"   a     NA   
#> 2 "a"   a     NA   
#> 3 ""    <NA>  NA   
#> 4 ""    <NA>  NA

Created on 2020-02-25 by the reprex package (v0.3.0)

@lindsayplatt
Copy link

@lindsayplatt lindsayplatt commented Feb 25, 2020

I am using a newer version.

packageVersion("tidyverse")
[1] ‘1.3.0’

@jzadra
Copy link

@jzadra jzadra commented Feb 25, 2020

tidyverse is different from tidyr; it is a collection of other packages put together for easy loading. So it will have a different version than all the packages within it. Check your tidyr version.

@lindsayplatt
Copy link

@lindsayplatt lindsayplatt commented Feb 25, 2020

Oh, sorry I saw that you were loading tidyverse so I assumed that was the version you were referring to. I always assumed that updating tidyverse would update the packages within it so I normally just update that one. I guess that is an inappropriate assumption!

Even with updating tidyr using the GitHub version, I still have that issue. Maybe it is another out-of-date package?

packageVersion("tidyr")
[1] ‘1.0.2.9000’
> library(tidyr)
> df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
> df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 a_NA  a     NA   
2 a_NA  a     NA   
3 NA    NA    NA   
4 NA    NA    NA  

@jzadra
Copy link

@jzadra jzadra commented Feb 25, 2020

Interesting. I'm not sure why we are getting different results.

Regardless, it looks to me as if your NA's aren't being removed despite na.rm = F.

Yes, I would try update your other packages and see if that solves it. But since both expand_grid and unite are from tidyr I'm not sure why that would be the case.

@lindsayplatt
Copy link

@lindsayplatt lindsayplatt commented Feb 25, 2020

It appears that my version of tidyselect was quite out-of-date (<1.0). I updated that and now it is functioning as expected.

packageVersion("tidyr")
[1] ‘1.0.2.9000’

packageVersion("tidyselect")
[1] ‘1.0.0’

library(tidyr)
df_notwork <- expand_grid(x = c("a", NA), y = c(NA, NA))
df_notwork %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)

# A tibble: 4 x 3
  z     x     y    
  <chr> <chr> <lgl>
1 "a"   a     NA   
2 "a"   a     NA   
3 ""    NA    NA   
4 ""    NA    NA 

@anjaollodart
Copy link

@anjaollodart anjaollodart commented Apr 1, 2020

Hello,

I've updated to all the latest versions of the packages (tidyr 1.0.2.900, tidyselect 1.0.0) and I'm still getting the same error. I tried Lindsay's df_notwork, and get the same version as what she has prior to the updates. Any help would be appreciated!

@lindsayplatt
Copy link

@lindsayplatt lindsayplatt commented Apr 2, 2020

@anjaollodart - perhaps you can try updating additional packages that tidyr depends on. It's just a guess, but the need to separately update tidyselect from tidyr was surprising to me, so maybe there is another package dependency that has the same issue.

@jvpon
Copy link

@jvpon jvpon commented Apr 2, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet