-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distinct function and lists in column bug #1670
Comments
Simpler reprex: df <- dplyr::data_frame(
x = list(list(1, 2), list(1, 2))
)
dplyr::distinct(df, x) @romainfrancois can you please take a look? |
This is because currently the binary representation is hashed, which is just a pointer to the value, and different in this scenario. The following returns just one row: a <- list(1, 2)
df <- dplyr::data_frame(
x = list(a, a)
)
dplyr::distinct(df, x) We should probably just give an error here. |
@krlmlr I would like to work on this issue. I am a beginner in open source contribution and also in dplyr. Please let me know what should be fixed here. Thanks!! |
@soniampub: I'd start writing a test that defines the expected behavior, it should fail with the current codebase. |
@krlmlr Sure let me know when you had checked in that test. Also in my local git is there a script to run all tests and see which one is failing, is there a documentation or mailing list that I should use to post beginner questions. I appreciate your help. |
The tests can be run with From my gut feeling, this issue requires a change to the C++ code. A pull request that contains a failing test and a fix, or even just a failing test, will be greatly appreciated. |
It happens the same with vectors: df <- dplyr::data_frame(
x = list(c(1, 2), c(1, 2))
)
dplyr::distinct(df, x)
# Result:
# A tibble: 2 × 1
x
<list>
1 <dbl [2]>
2 <dbl [2]> |
I think we can't do much about it except documenting it, see also #2222. |
This seems like a bug to me. |
Yes, but we'll have a very hard time resolving it. When should two arbitrary values be equal? How do you define/compute a hash value that is equal if the values are equal? Reference equality works for many use cases (e.g., if one data frame is derived from another). |
In that case, I'd rather throw a clean error. We should not be exposing R users to the low-level details of reference equality. |
This is what all_equal() does, and users complain too. How about providing an easy way to substitute columns we can't compare with their hashed equivalent (SEXP, sha1, hashr), and hint the user that they should use that before comparing? |
More recent ticket: #2355. |
@krlmlr feel free to close these duplicates |
For a column containing lists in data.frame, distinct will fail.
Example:
The text was updated successfully, but these errors were encountered: