Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer not automatically coerced to double #1892

Closed
russellpierce opened this issue Jun 7, 2016 · 14 comments · Fixed by #2487
Closed

Integer not automatically coerced to double #1892

russellpierce opened this issue Jun 7, 2016 · 14 comments · Fixed by #2487
Labels
bug an unexpected problem or unintended behavior

Comments

@russellpierce
Copy link
Contributor

This might be a won't fix, but I thought I'd float the balloon anyway.

Consider:

library(dplyr)
data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,NA,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  group_by(group) %>%
  mutate(value = ifelse(is.na(value),0,value))

Which fails.

Versus:

data_frame(
  id = c(1,2,3,4,5,6),
  value = c(1L,2L,3L,4,NA,NA),
  group = c("A","A","A","B","B","B")
) %>% 
  mutate(value = ifelse(is.na(value),0,value)) %>%
  group_by(group)

Which does not fail.

As a matter of linguistic parsing one might not expect different outcomes based on ordering. Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

@hadley hadley closed this as completed Jun 7, 2016
@hadley hadley reopened this Jun 7, 2016
@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

Thanks, confirmed. Works for me if I replace 0 by 0L in the failing example. Would you like to contribute a testthat test?

@krlmlr krlmlr added bug an unexpected problem or unintended behavior data frame labels Nov 7, 2016
@russellpierce
Copy link
Contributor Author

@krlmlr The test would be added to test-group-by-r.r, correct?

russellpierce pushed a commit to russellpierce/dplyr that referenced this issue Nov 11, 2016
@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2016

Is there a reason dplyr can't coerce the returns from the group_by operation to share in the most general common data type (as done elsewhere in R) so that both functional orders yield successful results?

Yes, if a grouped mutate uses different return types it's safer to throw an error; if this is really desired behavior the user can coerce. The only exception we'd like to allow is integer vs. double, and logical NAs, see also r-lib/vctrs#7 and dplyr::if_else().

Test location looks good to me.

@russellpierce
Copy link
Contributor Author

Also note that grouped mutate does not current coerce factor + character -> character (WARN) as expected. See PR #2249 for a demonstrative test.

russellpierce pushed a commit to russellpierce/dplyr that referenced this issue Nov 12, 2016
krlmlr pushed a commit that referenced this issue Dec 6, 2016
* Add failing test for #1892

* Correct Description of Test

* Correct Spelling

* Add Skip for Currently Failing Test

* Correct Formatting

* Correct Formatting and Clarify Data Type

* Increased Clarity of Test Expectation

* Add Failing Test

* Prefer typeof() in place of class()

* Single Instance per Group; Clarified Description; expect_type() rather than expect_equal()

* Code Area Covered by expect_warning() Reduced

* Explicitly Expect Warning to be Issued by Mutate

* Add Reference to Issue #1892

* Test are in Mutate Rather than group_by with Other Grouped Mutate Tests

* Add Expectations Regarding the Exact Output

* Drop explicit (and unnecessary) casting
@hadley
Copy link
Member

hadley commented Feb 21, 2017

Here's a simpler reprex, along with a suggestion that the code path is different for summarise():

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  value = c(1L, NA),
  group = c("A", "B")
) 

df %>% 
  group_by(group) %>%
  mutate(value = if (is.na(value)) 0 else 1L)
#> Error in mutate_impl(.data, dots): incompatible types, expecting a integer vector

df %>% 
  group_by(group) %>%
  summarise(value = if (is.na(value)) 0 else 1L)
#> # A tibble: 2 × 2
#>   group value
#>   <chr> <dbl>
#> 1     A     1
#> 2     B     0

@hadley
Copy link
Member

hadley commented Feb 27, 2017

Another reprex:

df <- tibble(
  g = c(1, 2, 2), 
  x = c(1L, 2L, 3L)
)

df %>%
  group_by(g) %>%
  mutate(median(x))

@hadley hadley changed the title Semantically inconsistent (?) response to varying return-types Integer not automatically coerced to double Feb 27, 2017
@hadley
Copy link
Member

hadley commented Feb 27, 2017

@zeehio any ideas what's going on here? Despite my previous assertion, I'd really like to fix this bug for the next release.

@zeehio
Copy link
Contributor

zeehio commented Feb 27, 2017

This issue goes through a different code path than the Collecter.h case I was working at.

The error is raised here https://github.com/hadley/dplyr/blob/61f77bc1d2dbf496cf5b72a750a24f134d506a45/inst/include/dplyr/Gatherer.h#L81

The wrong underlying assumption in Gatherer.h is that the type returned by (in this reprex case) median(x) will be always the same for all the groups. This bug has never been fully addressed in the past #489 (comment) as the solution proposed there was to fix whatever was inside mutate to always return the same type, and I feel that is a just workaround.

I think the best solution would be to use the Collecter.h code path inside gatherer. I can try to commit a pull request during the following two weeks if that is fine for you (I am quite busy this week).

@hadley
Copy link
Member

hadley commented Feb 27, 2017

Yes, that would be fantastic! It would fix a whole class of bugs.

@zeehio
Copy link
Contributor

zeehio commented Feb 28, 2017

Current solution (dplyr-0.5.0 and master):

Implementation

  1. Take the first group in the data frame for which the result of the mutate operation returns non-NA.
  2. Create a vector of the type of the result filled with NA with length equal to the total number of rows. (If it is a list, create a list of that length).
  3. Store the computed result of the first group in the vector.
  4. For each of the pending data frame groups, compute their result and store it in the vector (or list).
  5. The vector (or list) is the final result of the mutate operation.

Assumptions of the current algorithm:

  • The result of applying the mutate operation to any group can be NA.
  • If the result is not NA, then it can be a list or a vector.
  • All the non-NA results must be of the exact same type.

General case assumptions:

  • Each result of applying the mutate operation to a group can be of a different type.
  • If all the result are combinable and of consistent length, the final output may be a vector column, otherwise it needs to be a list column.

Possible general case implementation

  1. We compute all the groups storing the elements in a List.
  2. We combine_all() the list if possible. If it is possible, return a vector column, otherwise return a list column.

Issue with this general case

  • We need to compute all the results in advance to know whether or not they are combinable. Therefore we would need to store all the results in a List and use combine_all to combine them. This would require storing an extra copy of the results and may be not-desirable.

Tradeoff solution proposal

Assumptions

  • Each result of applying the mutate operation to a group can be of a different type. The only restriction is that either all the result are of type list (and then we return a list-column) or all the result are not a list (and then we combine them into a vector column or raise an error).

Advantage with respect to the general case:

  • We decide after computing the result for the first group whether we follow the vector path or the list path, avoiding the extra copy.

Advantage with respect to the current implementation:

  • Coercions and promotions will be done through the Collecter implementation.

@hadley if the tradeoff solution is good for you I will go with it.

@hadley
Copy link
Member

hadley commented Feb 28, 2017

@zeehio in the tradeoff solution, could you spell out in a bit more detail what happens if the types are different in different groups?

@zeehio
Copy link
Contributor

zeehio commented Feb 28, 2017

Yes, we will use the same rules as in combine():

  • Classes unknown to dplyr are stripped (known classes: POSIXct, factor, Date, AsIs, integer64, table) (as in Combine strips unknown classes. Support integer64. #2425)
  • NA values are accepted
  • integers are promoted to numerics if needed
  • factors and characters are coerced to characters
  • factors with different levels are coerced to characters

@hadley
Copy link
Member

hadley commented Feb 28, 2017

Ok, I don't quite see how this will work under the hood, but it sounds reasonable, and I'd love to review a PR 😁

zeehio added a commit to zeehio/dplyr that referenced this issue Mar 2, 2017
`mutate(col2 = fun(col1))` on a grouped data frame calls `fun` once per group.

It used to require that `fun` returns the exact same type and that was not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behaviour, so the returned vectors from each of the `fun` calls are combined using the same coercion rules than `combine` and `bind_rows`, defined in `Collecter.h`.
zeehio added a commit to zeehio/dplyr that referenced this issue Mar 2, 2017
`mutate(col2 = fun(col1))` on a grouped data frame calls `fun` once per group.

It used to require that `fun` returns the exact same type and that was not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behaviour, so the returned vectors from each of the `fun` calls are combined using the same coercion rules than `combine` and `bind_rows`, defined in `Collecter.h`.
@zeehio
Copy link
Contributor

zeehio commented Mar 2, 2017

@hadley I give you two PR instead of one. The second one includes the commits of the first one, because it depends on it, but after the first PR is merged the second PR will consist only of a single commit.

It was easier than expected to make it work, although my C++ and Rcpp still need some practice. If you find things done in a weird way, assume my Rcpp skills are not good and ask.

zeehio added a commit to zeehio/dplyr that referenced this issue Mar 2, 2017
`mutate(col2 = fun(col1))` on a grouped data frame calls `fun` once per group.

It used to require that `fun` returns the exact same type and that was not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behaviour, so the returned vectors from each of the `fun` calls are combined using the same coercion rules than `combine` and `bind_rows`, defined in `Collecter.h`.
hadley pushed a commit that referenced this issue Mar 5, 2017
* Add difftime support to Collecter.h

* Fix oldrel test

* Make mutate use collecter.h. Closes #1892

`mutate(col2 = fun(col1))` on a grouped data frame calls `fun` once per group.

It used to require that `fun` returns the exact same type and that was not desirable in functions that may return different (but compatible) types, such as integer and numeric.

This PR changes that behaviour, so the returned vectors from each of the `fun` calls are combined using the same coercion rules than `combine` and `bind_rows`, defined in `Collecter.h`.

* Add hms support and other @krlmlr feedback

* Further suggestions to PR  #2487 by @krlmlr

* Difftime collecter corrections suggested by @hadley
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants