Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA handling in unite #203

Closed
voxnonecho opened this issue Jun 13, 2016 · 19 comments

Comments

@voxnonecho
Copy link

commented Jun 13, 2016

Consider the following df:

ID   d1   d2   
1    G    G
2    A    G
3    A    A
4    G    A
5    NA   NA
6    G    G

When uniting d1 and d2:

tidyr::unite(df, new, d1, d2, remove = FALSE, sep = "")

Row 5 gives NANA instead of the expected NA

  ID  new   d1   d2
1  1   GG    G    G
2  2   AG    A    G
3  3   AA    A    A
4  4   GA    G    A
5  5 NANA <NA> <NA>
6  6   GG    G    G
@voxnonecho voxnonecho changed the title Strange way of handling NA in unite NA handling in unite Jun 13, 2016
@hadley

This comment was marked as resolved.

Copy link
Member

commented Jun 13, 2016

unite() is just following the standard paste rules:

paste(NA, NA)
#> [1] "NA NA"
@voxnonecho

This comment was marked as resolved.

Copy link
Author

commented Jun 13, 2016

I was thinking of a pre-processing treatement similar to: with(df, ifelse(is.na(d1)|is.na(d2), NA, paste0(d1, d2))).

@hadley

This comment was marked as resolved.

Copy link
Member

commented Jun 13, 2016

I think you need a compelling argument as to why unite() should work differently to paste()

@voxnonecho

This comment has been minimized.

Copy link
Author

commented Jun 13, 2016

Well, I think unite() should work like paste() but could maybe provide an additional argument to handle NAs, à la na.rm = TRUE

@danrlu

This comment has been minimized.

Copy link

commented Sep 8, 2016

I think in some cases the omit NA option could be useful. My df has many columns that contain mostly NA, as a result of multiple rounds of join.

recipe  potato  tomato  cucumber    rock
A       potato  NA      cucumber    NA
B       NA      NA      NA          rock
C       NA      tomato  NA          NA
...

So I was trying to combine the columns into one and remove the NA to see things better.

recipe  ingredients
A       potato,cucumber
B       rock
C       tomato
...

The solution is not hard, just not quite as tidy.

@tjohnson250

This comment was marked as resolved.

Copy link

commented Oct 26, 2016

I just ran into this issue and also suggest adding an option to handle NA in unite. In fact, I'd suggest that the following expressions (though perhaps with an extra param to omit NAs in unite) should produce output identical to its input:

df <- data.frame(x = c("a", "a b", "a b c", NA))
df
x
1 a
2 a b
3 a b c
4
df %>% separate(x, c("a", "b"), extra = "merge") %>% unite(x, a, b, sep=" ")
x
1 a NA
2 a b
3 a b c
4 NA NA
Warning message:
Too few values at 1 locations: 1

In other words, if separate and unite are complements one should be able to use one of them to reverse the operation of the other.

@jennybc

This comment has been minimized.

Copy link
Member

commented Oct 27, 2016

This is not the requested solution, but a clean way to get the desired result is:

library(tidyverse)
df <- tribble(
  ~ID, ~d1, ~d2,   
    1, "G", "G",
    2, "A", "G",
    3, "A", "A",
    4, "G", "A",
    5,  NA,  NA,
    6, "G", "G")
df %>% 
  replace_na(list(d1 = "", d2 = "")) %>% 
  unite(new, d1, d2, remove = FALSE, sep = "")
#> # A tibble: 6 × 4
#>      ID   new    d1    d2
#> * <dbl> <chr> <chr> <chr>
#> 1     1    GG     G     G
#> 2     2    AG     A     G
#> 3     3    AA     A     A
#> 4     4    GA     G     A
#> 5     5                  
#> 6     6    GG     G     G
@e-clin

This comment was marked as outdated.

Copy link

commented Jun 20, 2017

I would also suggest adding an option of na.rm = TRUE to handle NAs. Although @jennybc 's alternative solution works for this particular problem, it will show blanks and separators when the separator is not "".

My problem is the same with @danrlu 's. Is there a better and neat solution to ignore the NAs? Currently I just unite all columns and then str_replace_all NAs and adjacent separators with empty strings.

@alexpghayes

This comment was marked as outdated.

Copy link

commented Jun 23, 2017

Here is a generalization of unite that allows you to create a new column from an arbitrary function applied to columns selected as you would with unite. The function needs to return a character vector because it relies on pmap_chr, but swap in pmap_* to taste.

library(tidyverse)

df <- tribble(
  ~ID, ~d1, ~d2, ~d3,
  1, "G", "G", "C",
  2, NA, "G", "T",
  3, "A", NA, "G",
  4, "G", "A", "A",
  5,  NA,  NA, NA,
  6, "G", "G", "G")

#' Semi-general  \code{unite} to vectorize a function across columns of dataframe
#'
#' Accepts columns from a dataframe and vectorizes/parallel maps a function
#' across them, returning the result in a new column. Function must return a
#' character vector because \code{purrr::pmap_char} enforces type-safety.
#'
#' @param df A dataframe
#' @param col Bare (unquoted) name of results column
#' @param ... Bare (unquoted) names of argument columns
#' @param fun A function that accepts as many arguments as provided argument
#' columns. Gets passed to \code{purrr::pmap_chr} so formula-style lambda
#' specification also works.
#' @param remove Whether or not to remove the argument columns (defaults
#' to \code{true})
#'
#' @return Dataframe with new column generated by applying \code{fun} to
#' argument columns element-wise

combine <- function(df, col, ..., fun, remove = TRUE) {
  to_merge <- quos(...)
  new_col <- quo_name(enquo(col))
  merge_cols <- map_chr(to_merge, quo_name)

  df <- mutate(df, !!new_col := pmap_chr(list(!!!to_merge), fun))
  if (remove) df <- select(df, -one_of(merge_cols))
  df
}

combine(df, new, d1, d2, d3, fun = paste0)

#> # A tibble: 6 x 2
#>      ID    new
#>   <dbl>  <chr>
#> 1     1    GGC
#> 2     2   NAGT
#> 3     3   ANAG
#> 4     4    GAA
#> 5     5 NANANA
#> 6     6    GGG

I'm not a huge fan of this interface, but I have found it more useful than unite a couple of times. I've also found myself doing stupid things with this function and feel like it's some sort of anti-pattern.

If there are any changes coming to the unite interface, I'd enjoy something in the vein of facilitating vectorized operations for non-vectorized function across dataframe columns.

@alistaire47

This comment has been minimized.

Copy link

commented Sep 11, 2017

I'm not convinced that unite should work like paste, as it's a very rare situation when a user would actually want to turn NA values into strings. More concerningly, in terms of API consistency separate will introduce NAs in a way that unite can't reverse:

library(tidyr)

example <- tibble::data_frame(x = c('foo', 'foo bar', 'foo bar baz'))

example %>% separate(x, c('foo', 'bar', 'baz'), fill = 'right')    # without `fill = 'right'` same result with a message 
#> # A tibble: 3 x 3
#>     foo   bar   baz
#> * <chr> <chr> <chr>
#> 1   foo  <NA>  <NA>
#> 2   foo   bar  <NA>
#> 3   foo   bar   baz

example %>% 
    separate(x, c('foo', 'bar', 'baz'), fill = 'right') %>% 
    unite(x, foo:baz, sep = ' ')
#> # A tibble: 3 x 1
#>             x
#> *       <chr>
#> 1   foo NA NA
#> 2  foo bar NA
#> 3 foo bar baz

If NAs are in the middle of columns that get united and then separated then paste-like behavior would allow the NA location to be saved (at the cost of requiring them to be converted from strings to actual NA again), but most of the time the NA handling keeps the functions from being inverses. Making na.rm = TRUE the default would be a breaking change, but probably not one that would break much code.

@hadley hadley added the strings 🎻 label Nov 16, 2017
@hadley

This comment has been minimized.

Copy link
Member

commented Nov 16, 2017

There are actually two feature requests in this thread:

  1. Make NAs infections so that if any input is NA, then the output is NA
  2. Provide an easy way to drop NAs.

2. seems like the more useful option so I will implement that.

@alexpghayes the plan is to extract out a general helper for turning the vectorised functions that power many tidyr functions in a tibblicious version

@hadley

This comment was marked as outdated.

Copy link
Member

commented Nov 16, 2017

It'll be way easy (and faster) if this can be implemented at the stringi level, so I'm going to put this aside until gagolews/stringi#289 is resolved.

@moodymudskipper

This comment was marked as off-topic.

Copy link

commented Oct 10, 2018

gagolews/stringi#289 is closed :)

@hadley

This comment was marked as resolved.

Copy link
Member

commented Mar 7, 2019

@moodymudskipper it's closed but not implemented in stri_paste().

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Minimal reprex

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
unite(df, z, c("x", "y"), remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a_NA  a     <NA> 
#> 3 NA_b  <NA>  b    
#> 4 NA_NA <NA>  <NA>

Created on 2019-03-07 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Note that you'll need na.rm = TRUE (I left the default as is to preserve backward compatibility since it seems like many people have probably worked around the previous behaviour in various way)

library(tidyr)
df <- expand_grid(x = c("a", NA), y = c("b", NA))
df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
#> # A tibble: 4 x 3
#>   z     x     y    
#>   <chr> <chr> <chr>
#> 1 a_b   a     b    
#> 2 a     a     <NA> 
#> 3 b     <NA>  b    
#> 4 ""    <NA>  <NA>

Created on 2019-03-07 by the reprex package (v0.2.1.9000)

@hadley hadley closed this in 58df41d Mar 7, 2019
@kasperav

This comment has been minimized.

Copy link

commented Mar 28, 2019

Hi @hadley ,

I am having trouble getting na.rm = TRUE to work within the unite() function.

I tried the following:

  1. Update R from 3.5.1 to 3.5.3
  2. Delete the old tidyverse and tidyr packages
  3. install fresh tidyverse package
  4. run the following code:
> library("tidyr")
> df <- expand.grid(x = c("a", NA), y = c("b", NA))
> df
     x    y
1    a    b
2 <NA>    b
3    a <NA>
4 <NA> <NA>
> df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Which gives me this error:

Error: `TRUE` must evaluate to column positions or names, not a logical vector
Call `rlang::last_error()` to see a backtrace

Backtracing error:

> rlang::last_error()
<error>
message: `TRUE` must evaluate to column positions or names, not a logical vector
class:   `rlang_error`
backtrace:
  1. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10. tidyselect::vars_select(colnames(data), ...)
 11. tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 12. tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
 13. tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
Call `rlang::last_trace()` to see the full backtrace

> rlang::last_trace()
     x
  1. \-df %>% unite("z", x:y, na.rm = TRUE, remove = FALSE)
  2.   +-base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
  3.   \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  4.     \-base::eval(quote(`_fseq`(`_lhs`)), env, env)
  5.       \-global::`_fseq`(`_lhs`)
  6.         \-magrittr::freduce(value, `_function_list`)
  7.           +-base::withVisible(function_list[[k]](value))
  8.           \-function_list[[k]](value)
  9.             +-tidyr::unite(., "z", x:y, na.rm = TRUE, remove = FALSE)
 10.             \-tidyr:::unite.data.frame(., "z", x:y, na.rm = TRUE, remove = FALSE)
 11.               \-tidyselect::vars_select(colnames(data), ...)
 12.                 \-tidyselect:::bad_calls(bad, "must evaluate to { singular(.vars) } positions or names, \\\n       not { first_type }")
 13.                   \-tidyselect:::glubort(fmt_calls(calls), ..., .envir = .envir)
@hadley

This comment has been minimized.

Copy link
Member

commented Mar 28, 2019

@kasperav you probably have not installed the development version of tidyr.

@kasperav

This comment has been minimized.

Copy link

commented Mar 28, 2019

@hadley you are right! I have no luck with installing the dev version, so I'll wait for this to be implemented in a CRAN version of tidyr :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.