New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crossing()
to preserve factor levels.
#423
Conversation
Does this also fix #447? |
x <- addNA(x, ifany = TRUE) | ||
levs <- levels(x) | ||
factor(levs, levels = levs, ordered = is.ordered(x)) | ||
factor(levs, levels = orig_levs, ordered = is.ordered(x), exclude = NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you need both orig_levs
and exclude = NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they both are needed to ensure that NA
stays in levels:
# Original levels are "a" and `NA`
x <- factor("a", levels = c("a", NA), exclude = NULL)
levels(x)
#> [1] "a" NA
orig_levs <- levels(x)
x <- addNA(x, ifany = TRUE)
levs <- levels(x)
# If not `exclude = NULL` then `NA` in levels is dropped
factor(levs, levels = orig_levs, ordered = is.ordered(x))
#> [1] a <NA>
#> Levels: a
factor(levs, levels = orig_levs, ordered = is.ordered(x), exclude = NULL)
#> [1] a <NA>
#> Levels: a <NA>
Created on 2019-01-04 by the reprex package (v0.2.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't factor(levs, levels = levs, ordered = is.ordered(x), exclude = NULL)
sufficient?
(I have to say I'm surprised that exclude
affects manually supplied levels!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because NA
can be present in vector but not in levels. levs
is computed after addNA
(to account for present in vector NA
s), and this might produce different levels than original:
# Original level is only "a" but `NA` is present in vector
x <- factor(c("a", NA), levels = "a")
levels(x)
#> [1] "a"
x <- addNA(x, ifany = TRUE)
levs <- levels(x)
# If use `levs` then `NA` is added to levels but it wasn't there
factor(levs, levels = levs, ordered = is.ordered(x), exclude = NULL)
#> [1] a <NA>
#> Levels: a <NA>
Created on 2019-01-04 by the reprex package (v0.2.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the intent? Otherwise why do addNA()
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand your questions.
I supposed that the logic of ulevels()
is to create a vector with all unique values (which will then be used in at least spread()
, crossing()
, expand()
, and complete()
). The "uniqueness" is defined differently for factors and non-factors. For factors output should be "all levels()
" + "NA
if it is present in vector" (that is why addNA()
is used).
Also for some reason it was decided to be also factor, and that's where the problem for this PR begins: if NA
is present in original levels, it is dropped with current implementation. This PR seems to suggest the least "interfering" way to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having changed factor levels introduces much bigger problems with complete()
:
library(tidyr)
packageVersion("tidyr")
#> [1] '0.8.2'
df <- data.frame(
x = factor(c(1, NA), exclude = NULL),
y = factor(c(NA, 2), exclude = NULL)
)
str(df$x)
#> Factor w/ 2 levels "1",NA: 1 2
str(df$y)
#> Factor w/ 2 levels "2",NA: 2 1
df_completed <- complete(df, x, y)
#> Warning: Column `x` joining factors with different levels, coercing to
#> character vector
#> Warning: Column `y` joining factors with different levels, coercing to
#> character vector
str(df_completed$x)
#> chr [1:4] "1" "1" NA NA
str(df_completed$y)
#> chr [1:4] "2" NA "2" NA
Created on 2019-01-04 by the reprex package (v0.2.1)
After modifying ulevels()
this code has no warnings and output is as expected (factors with NA
in levels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand the existing logic of this function — it has an old heritage (it's originally from plyr) and it's probably not well tested. I want to make sure that we're making the smallest possible change to achieve the desired outcome because that minimise the chance of unanticipated consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, got it.
This change will affect only the code where NA
was intentionally brought into levels (before crossing()
, etc.). This seems to be both rare and associated with undesired behaviour under current implementation.
The nature of this problem is similar to #165: output "unique" factor doesn't have the same metadata as input one.
@@ -52,6 +52,20 @@ test_that("preserves NAs", { | |||
expect_equal(nesting(x)$x, x) | |||
}) | |||
|
|||
test_that("crossing preserves factor levels", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have one test for crossing()
, but since the change is actually in ulevel()
would you mind focussing most of your testing energy there?
No, it's not. |
Thanks for the code and all the explanation — it's much appreciated! |
This fixes #410 . Couple of points:
complete()
coercing factor withNA
in levels to character. Should I add the test for that?