Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct function and lists in column bug #1670

Closed
zhilongjia opened this issue Feb 20, 2016 · 14 comments
Closed

distinct function and lists in column bug #1670

zhilongjia opened this issue Feb 20, 2016 · 14 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@zhilongjia
Copy link

For a column containing lists in data.frame, distinct will fail.
Example:

df <- data.frame(
    x = sample(10, 100, rep = TRUE),
    y = sample(10, 100, rep = TRUE)
)
df$z <- list(c("a", "b", "c"))
df$z[3] <- list(c("a", "b", "c"))
dplyr::distinct(df, z)
# result:
# x y       z
#8 3 a, b, c
#5 5 a, b, c

# shoud be
#8 3 a, b, c
identical(df$z[1], df$z[3])
# TRUE
@hadley
Copy link
Member

hadley commented Mar 1, 2016

Simpler reprex:

df <- dplyr::data_frame(
  x = list(list(1, 2), list(1, 2))
)
dplyr::distinct(df, x)

@romainfrancois can you please take a look?

@hadley hadley added the bug an unexpected problem or unintended behavior label Mar 1, 2016
@hadley hadley added this to the 0.5 milestone Mar 1, 2016
@hadley hadley modified the milestones: future, 0.5 May 26, 2016
@krlmlr
Copy link
Member

krlmlr commented Sep 25, 2016

This is because currently the binary representation is hashed, which is just a pointer to the value, and different in this scenario. The following returns just one row:

a <- list(1, 2)

df <- dplyr::data_frame(
  x = list(a, a)
)
dplyr::distinct(df, x)

We should probably just give an error here.

@soniampub
Copy link

@krlmlr I would like to work on this issue. I am a beginner in open source contribution and also in dplyr. Please let me know what should be fixed here. Thanks!!

@krlmlr
Copy link
Member

krlmlr commented Oct 6, 2016

@soniampub: I'd start writing a test that defines the expected behavior, it should fail with the current codebase.

@soniampub
Copy link

soniampub commented Oct 10, 2016

@krlmlr Sure let me know when you had checked in that test. Also in my local git is there a script to run all tests and see which one is failing, is there a documentation or mailing list that I should use to post beginner questions. I appreciate your help.

@krlmlr
Copy link
Member

krlmlr commented Oct 10, 2016

The tests can be run with devtools::test(). StackOverflow is a good place to ask questions of any kind.

From my gut feeling, this issue requires a change to the C++ code. A pull request that contains a failing test and a fix, or even just a failing test, will be greatly appreciated.

@MarcAragones
Copy link

It happens the same with vectors:

df <- dplyr::data_frame(
  x = list(c(1, 2), c(1, 2))
)
dplyr::distinct(df, x)

# Result:
# A tibble: 2 × 1
          x
     <list>
1 <dbl [2]>
2 <dbl [2]>

@krlmlr
Copy link
Member

krlmlr commented Nov 8, 2016

I think we can't do much about it except documenting it, see also #2222.

@hadley
Copy link
Member

hadley commented Nov 8, 2016

This seems like a bug to me.

@krlmlr
Copy link
Member

krlmlr commented Nov 8, 2016

Yes, but we'll have a very hard time resolving it. When should two arbitrary values be equal? How do you define/compute a hash value that is equal if the values are equal?

Reference equality works for many use cases (e.g., if one data frame is derived from another).

@hadley
Copy link
Member

hadley commented Nov 8, 2016

In that case, I'd rather throw a clean error. We should not be exposing R users to the low-level details of reference equality.

@krlmlr
Copy link
Member

krlmlr commented Nov 8, 2016

This is what all_equal() does, and users complain too.

How about providing an easy way to substitute columns we can't compare with their hashed equivalent (SEXP, sha1, hashr), and hint the user that they should use that before comparing?

@krlmlr
Copy link
Member

krlmlr commented Feb 10, 2017

More recent ticket: #2355.

@hadley
Copy link
Member

hadley commented Feb 10, 2017

@krlmlr feel free to close these duplicates

@hadley hadley closed this as completed Feb 10, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

6 participants