Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crossing() to preserve factor levels. #423

Merged
merged 4 commits into from Jan 4, 2019

Conversation

echasnovski
Copy link
Contributor

This fixes #410 . Couple of points:

  • Should I rename crossing() removes NA from factor levels #410 to "crossing() doesn't preserve factor levels" or this PR to "crossing() preserves NAs in factor levels"? I feel the first one is better.
  • This PR also fixes the issue with complete() coercing factor with NA in levels to character. Should I add the test for that?

@hadley
Copy link
Member

hadley commented Jan 4, 2019

Does this also fix #447?

x <- addNA(x, ifany = TRUE)
levs <- levels(x)
factor(levs, levels = levs, ordered = is.ordered(x))
factor(levs, levels = orig_levs, ordered = is.ordered(x), exclude = NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you need both orig_levs and exclude = NULL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they both are needed to ensure that NA stays in levels:

# Original levels are "a" and `NA`
x <- factor("a", levels = c("a", NA), exclude = NULL)
levels(x)
#> [1] "a" NA

orig_levs <- levels(x)
x <- addNA(x, ifany = TRUE)
levs <- levels(x)

# If not `exclude = NULL` then `NA` in levels is dropped
factor(levs, levels = orig_levs, ordered = is.ordered(x))
#> [1] a    <NA>
#> Levels: a
factor(levs, levels = orig_levs, ordered = is.ordered(x), exclude = NULL)
#> [1] a    <NA>
#> Levels: a <NA>

Created on 2019-01-04 by the reprex package (v0.2.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't factor(levs, levels = levs, ordered = is.ordered(x), exclude = NULL) sufficient?

(I have to say I'm surprised that exclude affects manually supplied levels!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because NA can be present in vector but not in levels. levs is computed after addNA (to account for present in vector NAs), and this might produce different levels than original:

# Original level is only "a" but `NA` is present in vector
x <- factor(c("a", NA), levels = "a")
levels(x)
#> [1] "a"

x <- addNA(x, ifany = TRUE)
levs <- levels(x)

# If use `levs` then `NA` is added to levels but it wasn't there
factor(levs, levels = levs, ordered = is.ordered(x), exclude = NULL)
#> [1] a    <NA>
#> Levels: a <NA>

Created on 2019-01-04 by the reprex package (v0.2.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the intent? Otherwise why do addNA() at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand your questions.

I supposed that the logic of ulevels() is to create a vector with all unique values (which will then be used in at least spread(), crossing(), expand(), and complete()). The "uniqueness" is defined differently for factors and non-factors. For factors output should be "all levels()" + "NA if it is present in vector" (that is why addNA() is used).

Also for some reason it was decided to be also factor, and that's where the problem for this PR begins: if NA is present in original levels, it is dropped with current implementation. This PR seems to suggest the least "interfering" way to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having changed factor levels introduces much bigger problems with complete():

library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.2'

df <- data.frame(
  x = factor(c(1, NA), exclude = NULL),
  y = factor(c(NA, 2), exclude = NULL)
)
str(df$x)
#>  Factor w/ 2 levels "1",NA: 1 2
str(df$y)
#>  Factor w/ 2 levels "2",NA: 2 1

df_completed <- complete(df, x, y)
#> Warning: Column `x` joining factors with different levels, coercing to
#> character vector
#> Warning: Column `y` joining factors with different levels, coercing to
#> character vector

str(df_completed$x)
#>  chr [1:4] "1" "1" NA NA
str(df_completed$y)
#>  chr [1:4] "2" NA "2" NA

Created on 2019-01-04 by the reprex package (v0.2.1)

After modifying ulevels() this code has no warnings and output is as expected (factors with NA in levels).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand the existing logic of this function — it has an old heritage (it's originally from plyr) and it's probably not well tested. I want to make sure that we're making the smallest possible change to achieve the desired outcome because that minimise the chance of unanticipated consequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, got it.

This change will affect only the code where NA was intentionally brought into levels (before crossing(), etc.). This seems to be both rare and associated with undesired behaviour under current implementation.

The nature of this problem is similar to #165: output "unique" factor doesn't have the same metadata as input one.

@@ -52,6 +52,20 @@ test_that("preserves NAs", {
expect_equal(nesting(x)$x, x)
})

test_that("crossing preserves factor levels", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should have one test for crossing(), but since the change is actually in ulevel() would you mind focussing most of your testing energy there?

@echasnovski
Copy link
Contributor Author

Does this also fix #447?

No, it's not. nesting() uses dplyr::distinct() and not ulevels().

@hadley hadley merged commit b3289dd into tidyverse:master Jan 4, 2019
@hadley
Copy link
Member

hadley commented Jan 4, 2019

Thanks for the code and all the explanation — it's much appreciated!

@echasnovski echasnovski deleted the crossing-fac-levs branch January 5, 2019 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

crossing() removes NA from factor levels
2 participants