-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #2203 (combine with list of characters with NA) #2209
Conversation
I wonder if it would be better to be stricter and only check for logicals of length 1? Do we need to apply the same principle for other atomic vectors, or are they already covered? Either way, it's probably worth pulling |
Logicals with all dplyr::combine(list("a", "b", "c", c(NA, NA), "e")) Should work like: dplyr::combine(list("a", "b", "c", c(NA_character_, NA_character_), "e")) Returning c("a", "b", "c", NA, NA, "e") I will write tests for the other collecters to check (a) if this |
> combine(list(1:3, NA))
[1] 1 2 3 NA
> combine(list(1.5:3, NA))
[1] 1.5 2.5 NA
> combine(list(complex(real = 1.5:3), NA))
Error in eval(substitute(expr), envir, enclos) :
Can not automatically convert from complex to logical.
> combine(list(factor(1.5:3), NA))
Error in eval(substitute(expr), envir, enclos) :
Can not automatically convert from factor to logical. |
Test and fix if necessary that combine handles missing values with:
|
My main concern with checking the value for vectors > length 1 is performance since you're now scanning the value of every element. But I guess it can't actually cause a performance regression, since currently the code throws an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, thanks. @hadley: Okay to merge after 2x splitting collect()?
STORAGE* source_ptr = Rcpp::internal::r_vector_start<RTYPE>(source); | ||
for (int i=0; i<index.size(); i++) { | ||
data[index[i]] = source_ptr[i]; | ||
if (all_logical_na(v)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way Collecter::collect() is split into functions with clear names.
int* source_ptr = Rcpp::internal::r_vector_start<INTSXP>(source); | ||
for (int i=0; i<index.size(); i++) { | ||
if (source_ptr[i] == NA_INTEGER) { | ||
if (Rf_inherits(v, "factor") && has_same_levels_as(v)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
This still requires further work, some cases where there is an initial NA fail: combine(list(NA, "hello"))
# Error cannot convert from logical to character Similar errors happen with factors and complex. Solutions (I need to work on them):
|
@@ -9,6 +9,10 @@ | |||
|
|||
namespace dplyr { | |||
|
|||
static inline bool all_logical_na(SEXP x) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we somehow reuse is_logical_all_na() below?
I think the collecter should be strict and only mix logicals with other types (including numerics) if only NAs were seen. I remember fixing very similar issues in DelayedProcessor.h. |
Thanks for all your feedback, comments and suggestions |
@hadley: What should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. I made some comments on the tests in order to make them a bit more compact.
|
||
test_that("combine works with NA and Date (#2203)", { | ||
# NA first | ||
expected_result <- as.Date(c(NA, "2010-01-01", "2010-01-02", NA, "2010-01-04")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a little easier to read if you initialised as as.Date("2010-01-01") + c(NA, 0, 1, NA, 3)
etc
expected_result <- as.Date(c(NA, "2010-01-01", "2010-01-02", NA, "2010-01-04")) | ||
works1 <- combine(list(NA, as.Date("2010-01-01"), as.Date("2010-01-02"), | ||
as.Date(NA), as.Date("2010-01-04"))) | ||
expect_equal(works1, expected_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could just be expect_equal(combine(as.list(works1)), works1)
|
||
# NA length > 1 | ||
expected_result <- c(as.Date(c("2010-01-01", "2010-01-02", NA, NA, "2010-01-04"))) | ||
works1 <- combine(list(as.Date("2010-01-01"), as.Date("2010-01-02"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you could maybe make the principle easier to understand by doing split(works1, c(1, 2, 3, 3, 4))
or similar.
expect_equal(works1, expected_result) | ||
|
||
# NA length == 1 | ||
expected_result <- complex(real = c(1, 2, NA, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c(1, 2, NA, 4) + 1i
would be more compact
@krlmlr For now I'd say it should be an error because we want to be strict about coercing types (integer -> double is one exception). We will need to write up the exact principles in vctrs - currently we have somewhat behaviour spread across multiple functions (i.e. r-lib/vctrs#7). |
# (@test-binds.R#123)
test_that("bind_rows promotes logical to integer", {
df1 <- data_frame(a = FALSE)
df2 <- data_frame(a = 1L)
res <- bind_rows(df1, df2)
expect_equal(res$a, c(0L, 1L))
})
So the question now is, do we remove the coercion from logical to integer in In case you decide to remove the coercion I will replace the expect_equal in the test above to an expect_error. Before merging, the NEWS entry will need to be updated. Probably tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The logical -> integer test originates in 51ddd60, I think we can change it.
# NA first | ||
expected_result <- factor(c(NA, "a", "c", NA, "b"), levels = c("a","b","c")) | ||
works1 <- combine(list(NA, | ||
factor("a", levels = c("a","b","c")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified e.g. by defining x <- factor(letters[1:3])
and then using x[[1]]
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done eaa9b0d
# Date-NA is Date (unlist would coerce to numeric) | ||
# logicalNA_POSIXct: We add tzone = "" | ||
# POSIXct_POSIXct (unlist would coerce to numeric) | ||
pairs_result <- list("factor_character" = c("a", "b"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about defining another column in the data frame (of type list)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 698e750
expect_equal(works3, expected_result) | ||
}) | ||
|
||
test_that("combine is strict combining/promoting types", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very thorough test, I appreciate that. Can we this into a meta-code that generates explicit but simple expect_equal() or expect_error() statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not know how to do that and I could not find any example using testthat to guide me.
However, when I run the test-combine.R
script the expect_that
functions that fail print the info
argument that contains Pair: (item1, item2)
. This is printed when running R CMD check
:
1. Failure: combine is strict combining/promoting types (@test-combine.R#214) --
combine(items[c(pairs$Var1[i], pairs$Var2[i])]) did not throw an error.
Pair: ( integer , character )
So by reading this I know that coercing an integer and a character did not throw an error as expected (this is a dummy example, it actually gives an error)
I imagine that you would rather have combine(items[c("integer", "character")]) did not throw an error.
as an error message, but I do not know how to do that substitution. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know how to do that now. Fix is on the way
@@ -177,11 +210,13 @@ namespace dplyr { | |||
|
|||
inline bool compatible(SEXP x) { | |||
int RTYPE = TYPEOF(x); | |||
return (INTSXP == RTYPE || RTYPE == LGLSXP) && !Rf_inherits(x, "factor"); | |||
return (INTSXP == RTYPE && !Rf_inherits(x, "factor") && | |||
!Rf_inherits(x, "POSIXct") && !Rf_inherits(x, "Date")) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't completely failsafe against other classes such as "difftime"
. I wonder if the additional checks here are worth the effort, or if we should just let it slide and wait for r-lib/vctrs#7 (and potentially r-lib/vctrs#27). @hadley?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably treat any vector with a class as incompatible by default, but I agree we don't need to do that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not implement support for "difftime", but now it treats any "integer with a class" or "real with a class" as incompatible (0b9ed6a) so if someone tries to collect an unsupported class (such as "difftime" or "dummyint" -in the test case-) the collecter returns an error. If you are not sure about accepting this behaviour I can revert it and difftime will be coerced to integer again (with potential loss of information).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe raising a warning and then coercing the class is a better solution for now (breaks less code and allows users to adapt).
I could also allow coercion from logical to integer with a warning, so the breaking change is softer. The decision is up to you. :-)
@@ -63,6 +63,8 @@ | |||
|
|||
* `mutate_all()` etc now accept unnamed additional arguments. | |||
|
|||
* `combine()` accepts `NA` values (#2203, @zeehio) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking changes should be documented clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 9fb42da
return new Collecter_Impl<INTSXP>(n); | ||
else { | ||
SEXP classes = Rf_getAttrib(model, R_ClassSymbol); | ||
stop("Can't collect elements of class %s", CHAR(STRING_ELT(classes, 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this a warning for now
be40e25
to
918dac4
Compare
This is the summary of all the 11x11 = 121 possible coercions I have tested. I have considered all pairs of the following types:
The behaviour implemented in this PR allows to combine 35 out of 121 possible pairs. Allow combine with NACustom classes may lose attributes or have wrong attributes if a NA is present, therefore they give a warning.
Allow combine with the same typeCustom classes may lose attributes and therefore give a warning.
Allowed coercions
Not allowed combinations
I have rebased and squashed the commits into 4 manageable commits. I will appreciate your comments. Oh, and in case a regression appeared in the future, the error message would explicit (here is an example of what would be printed):
|
This has been left here ready for review, but no hurries (end of year is usually busy, so happy new year to all of you) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all your work on "this small pull request" ;-) This will be very useful for vctrs, too, we can probably use most of the tests there right away.
Would you mind doing another small round?
label = paste0("combine(items[c(\"", var1, "\", \"", var2, "\")])"), | ||
expected.label = deparse(result)) | ||
} else { | ||
expect_error(suppressWarnings(combine(item_pair)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer expect_error(expect_warning(...))
, does that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_error(expect_warning(...))
will not work as expected, because if a bug causes ...
to run without warnings, then expect_warning
will raise an error that will be accepted and suppressed by expect_error
and the test will succeed where it should have failed.
I am working on all the other comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about expect_warning(expect_error(...))
?
library(testthat)
with_reporter("summary",
test_that("error + warning", {
expect_warning(expect_error({warning(); stop()}))
expect_warning(expect_error({stop()}))
expect_warning(expect_error({warning()}))
expect_warning(expect_error({}))
})
)
#> ...12.34
#> Failed --------------------------------------------------------------------
#> 1. Failure: error + warning (@<text>#6) -----------------------------------
#> expect_error(...) showed 0 warnings
#>
#>
#> 2. Failure: error + warning (@<text>#7) -----------------------------------
#> {
#> ...
#> } did not throw an error.
#>
#>
#> 3. Failure: error + warning (@<text>#8) -----------------------------------
#> {
#> ...
#> } did not throw an error.
#>
#>
#> 4. Failure: error + warning (@<text>#8) -----------------------------------
#> expect_error(...) showed 0 warnings
#>
#>
#> DONE ======================================================================
combine_pair_test <- function(item_pair, var1, var2, result, | ||
can_combine = TRUE, warning = FALSE) { | ||
if (can_combine) { | ||
if (warning) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified:
warning_regexp <- if (warning) ".*" else NA
expect_warning(..., warning_regexp)
if (can_combine) { | ||
if (warning) { | ||
expect_warning(res <- combine(item_pair), | ||
label = paste0("combine(items[c(\"", var1, "\", \"", var2, "\")])")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please extract a variable here?
} | ||
|
||
|
||
prepare_table_with_coercion_rules <- function(items) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to declare items
in the function, because the code below is tightly coupled with the names. However, you could split this large function in several smaller functions.
expect_equal(works3, expected_result) | ||
}) | ||
|
||
combine_pair_test <- function(item_pair, var1, var2, result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of the pair tests and the test logic should perhaps live in a helper file, tests/testthat/helper-combine.R
or similar.
} | ||
|
||
combine_coercion_types <- function() { | ||
items <- list(logicalvalue = TRUE, logicalNA = NA, integer = 4L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use logicalNA = NA, logicalNA = c(NA, NA)
here? I suspect we could then get rid of the explicit tests.
switch (TYPEOF(model)) { | ||
case INTSXP: | ||
if (Rf_inherits(model, "Date")) | ||
return new TypedCollecter<INTSXP>(n, get_date_classes()); | ||
if (Rf_inherits(model, "factor")) | ||
return new Collecter_Impl<STRSXP>(n); | ||
if (has_classes(model)) { | ||
SEXP classes = Rf_getAttrib(model, R_ClassSymbol); | ||
Rf_warning("Coercing class %s, with possible loss of information", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coercing into what?
return new Collecter_Impl<INTSXP>(n); | ||
case REALSXP: | ||
if (Rf_inherits(model, "POSIXct")) | ||
return new POSIXctCollecter(n, Rf_getAttrib(model, Rf_install("tzone"))); | ||
if (Rf_inherits(model, "Date")) | ||
return new TypedCollecter<REALSXP>(n, get_date_classes()); | ||
if (has_classes(model)) { | ||
SEXP classes = Rf_getAttrib(model, R_ClassSymbol); | ||
Rf_warning("Coercing class %s, with possible loss of information", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... into what?
Oh, don't worry about the merge conflict, I can resolve it. |
prepare_table_with_coercion_rules is decoupled from the items variable as suggested in comments on tidyverse#2209.
I have taken care of all your comments. The most noticeable change is that I decoupled I am ready for the next round of comments if there still are any other left 👍 |
prepare_table_with_coercion_rules is decoupled from the items variable as suggested in comments on tidyverse#2209.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Almost ready ;-)
int_with_class = structure(4L, class = "int_with_class"), | ||
num_with_class = structure(4.5, class = "num_with_class")) | ||
|
||
pairs <- prepare_table_with_coercion_rules(items) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define items in prepare_table_with_coercion_rules() and call that function without arguments? This probably means that we need to add a new list column "item_pair" to the pairs data frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
can_be_combined <- function(item1, item2, | ||
class1, class2, | ||
all_na1, all_na2, | ||
known_to_dplyr1, known_to_dplyr2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of local functions. Do they access variables other than their arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you prefer, I have placed those local functions as conventional functions. 👍
Like you, I don't like functions that use variables out of their scope (variables other than their arguments). However I find it practical to have local functions when there is only one function that calls them (a local function, like a local variable).
As this is just a matter of taste, I have avoided using local functions as you requested 😃
I will finish on Monday evening. A romantic weekend in Rome awaits me. Thanks for all the time and reviews! |
…n rules This commit fixes issue tidyverse#2203, allowing combine to deal with missing values. Additionally it restricts coercion rules, in particular coercing logical to integer or double is not allowed anymore. Other coercion cases will give warnings, if information may be lost in the conversion, for instance when coercing integers with classes, such as difftime.
This commit checks the coercion rules of many pairs of types: - logical values - logical missing value - character - integer - factor - double - complex - integer with class - double with class - Date - POSIXct
prepare_table_with_coercion_rules is decoupled from the items variable as suggested in comments on tidyverse#2209.
- Replace local functions - Define variable inside function
Thanks! |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
This rather small pull request:
NA
and converting them toNA_character_
.Any feedback will be very much appreciated