New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crossing() removes NA from factor levels #410

Closed
echasnovski opened this Issue Feb 10, 2018 · 5 comments

Comments

Projects
None yet
3 participants
@echasnovski
Copy link
Contributor

echasnovski commented Feb 10, 2018

library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.0.9000'
x_fac <- factor(c(1, NA), exclude = NULL)
levels(x_fac)
#> [1] "1" NA
x_cross <- crossing(x_fac)
levels(x_cross$x_fac)
#> [1] "1"

The root of this seems to be in ulevels():

ulevels(factor(c(1, NA), exclude = NULL))
#> [1] 1    <NA>
#> Levels: 1

Adding exclude = NULL in factor() inside ulevels() solves the issue (and passes all current tests).

If this is an unintended behavior, I am ready to make a PR.

@batpigandme

This comment was marked as resolved.

Copy link
Member

batpigandme commented Feb 10, 2018

I could be wrong here, but I was under the impression that NULL and NA were distinct in R, and in the tidyverse (e.g. see in vectors section of R4DS. So, it's possible that the problem has to do with the fact that you're excluding NULL. as opposed to NA.

library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.0'
x_fac <- factor(c(1, NA), exclude = NULL)
levels(x_fac)
#> [1] "1" NA
x_cross <- crossing(x_fac)
levels(x_cross$x_fac)
#> [1] "1"

x_fac <- factor(c(1, NA), exclude = NA)
levels(x_fac)
#> [1] "1"
x_cross <- crossing(x_fac)
levels(x_cross$x_fac)
#> [1] "1"

Created on 2018-02-10 by the reprex package (v0.2.0).

@echasnovski

This comment was marked as resolved.

Copy link
Contributor

echasnovski commented Feb 10, 2018

My goal is to create a factor which has NA in its levels. The way to do it, is to ensure that NA is not excluded from factor levels. This is done during x_fac creation by supplying to factor() option exclude = NULL as opposed to default exclude = NA.

I think that crossing() should preserve levels of its input, which is not true in case there is NA there.

@echasnovski

This comment was marked as resolved.

Copy link
Contributor

echasnovski commented Feb 10, 2018

The way I found this is by doing complete() on factor columns with NA in levels:

library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.0.9000'
df <- data.frame(
  x = factor(c(1, NA), exclude = NULL),
  y = factor(c(NA, 2), exclude = NULL)
)
str(df$x)
#>  Factor w/ 2 levels "1",NA: 1 2
str(df$y)
#>  Factor w/ 2 levels "2",NA: 2 1
df_completed <- complete(df, x, y)
#> Warning: Column `x` joining factors with different levels, coercing to
#> character vector
#> Warning: Column `y` joining factors with different levels, coercing to
#> character vector
str(df_completed$x)
#>  chr [1:4] "1" "1" NA NA
str(df_completed$y)
#>  chr [1:4] "2" NA "2" NA

After modifying ulevels() this code has no warnings and output is as expected (factors with NA in levels).

@echasnovski

This comment has been minimized.

Copy link
Contributor

echasnovski commented Feb 14, 2018

This might be a little more complicated. Modification of ulevels() with exclude = NULL changes factor levels if there is NA in vector but not in levels. It seems very important to always preserve factor levels in crossing() (and hence expand() and complete()), but they also should account for present NAs in vector. So mayby this version of ulevels() is better?

ulevels <- function(x) {
  if (is.factor(x)) {
    orig_levs <- levels(x)
    x <- addNA(x, ifany = TRUE)
    levs <- levels(x)
    factor(levs, levels = orig_levs, ordered = is.ordered(x), exclude = NULL)
  } else {
    sort(unique(x), na.last = TRUE)
  }
}

This version also passes all tests. With it, crossing() preserves factor levels. Also complete() will not convert factors to characters (due to dplyr::left_join() behaviour):

# Code is RUN WITH MODIFIED version of tidyr
library(tidyr)

# `crossing()` preserves levels
x_na_lev <- factor(c(1, NA), exclude = NULL)
crossing(x_na_lev)$x_na_lev
#> [1] 1    <NA>
#> Levels: 1 <NA>
x_na_lev_extra <- factor(c(1, NA), levels = c(1, 2, NA), exclude = NULL)
crossing(x_na_lev_extra)$x_na_lev_extra
#> [1] 1    2    <NA>
#> Levels: 1 2 <NA>
x_no_na_lev <- factor(c(1, NA))
crossing(x_no_na_lev)$x_no_na_lev
#> [1] 1    <NA>
#> Levels: 1
x_no_na_lev_extra <- factor(c(1, NA), levels = c(1, 2))
crossing(x_no_na_lev_extra)$x_no_na_lev_extra
#> [1] 1    2    <NA>
#> Levels: 1 2

# `complete()` also preserves with no warnings
df <- data.frame(x_na_lev, x_na_lev_extra, x_no_na_lev, x_no_na_lev_extra,
                 data = 10:11)
str(complete(df, x_na_lev, x_na_lev_extra, x_no_na_lev, x_no_na_lev_extra))
#> Classes 'tbl_df', 'tbl' and 'data.frame':    36 obs. of  5 variables:
#>  $ x_na_lev         : Factor w/ 2 levels "1",NA: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ x_na_lev_extra   : Factor w/ 3 levels "1","2",NA: 1 1 1 1 1 1 2 2 2 2 ...
#>  $ x_no_na_lev      : Factor w/ 1 level "1": 1 1 1 NA NA NA 1 1 1 NA ...
#>  $ x_no_na_lev_extra: Factor w/ 2 levels "1","2": 1 2 NA 1 2 NA 1 2 NA 1 ...
#>  $ data             : int  10 NA NA NA NA NA NA NA NA NA ...
@hadley

This comment was marked as resolved.

Copy link
Member

hadley commented Feb 15, 2018

@batpigandme exclude = NULL is a correct (if weird) way of saying don't exclude NA values from factor levels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment