Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique operator #97

Closed
hadley opened this issue Oct 25, 2013 · 10 comments
Closed

Unique operator #97

hadley opened this issue Oct 25, 2013 · 10 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Oct 25, 2013

Or similar - would translate to DISTINICT in SQL.

@hadley hadley modified the milestones: 0.3, v0.2 Mar 17, 2014
@romainfrancois
Copy link
Member

Perhaps distinct so that we don't have more problems with the namespace police. Should not be too much trouble to use the visitors classes we already have to get unique rows more efficiently than what R's unique does.

@romainfrancois
Copy link
Member

Alright, R's unique pastes all columns together and then uses duplicated on that silly character vector.

> unique.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...)
{
    if (!identical(incomparables, FALSE))
        .NotYetUsed("incomparables != FALSE")
    x[!duplicated(x, fromLast = fromLast, ...), , drop = FALSE]
}
<bytecode: 0x105a2a188>
<environment: namespace:base>
> duplicated.data.frame
function (x, incomparables = FALSE, fromLast = FALSE, ...)
{
    if (!identical(incomparables, FALSE))
        .NotYetUsed("incomparables != FALSE")
    if (length(x) != 1L)
        duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast)
    else duplicated(x[[1L]], fromLast = fromLast, ...)
}

@hadley
Copy link
Member Author

hadley commented Apr 2, 2014

I like distinct! Another option would be to use union which I've already made generic. We also need intersect and setdiff for data frames to cover the remaining options.

romainfrancois added a commit that referenced this issue Apr 2, 2014
@romainfrancois
Copy link
Member

I pushed some initial code.

> distinct <- distinct_impl
> mtcars %.% select(cyl) %.% distinct()
  cyl
1   6
2   4
3   8

Should the data be ordered at the end ? At the moment, it is in the order of appearance in the original data.

@romainfrancois
Copy link
Member

Turns out, we already have union, intersect and setdiff, match:
https://github.com/hadley/dplyr/blob/master/src/dplyr.cpp#L806

@romainfrancois
Copy link
Member

@hadley hadley self-assigned this Jul 28, 2014
@hadley
Copy link
Member Author

hadley commented Jul 28, 2014

Looks good. I'll add the generic and methods for data frames, data table and sql.

@hadley
Copy link
Member Author

hadley commented Jul 28, 2014

@romainfrancois could you add a second argument to distinct_impl, a vector of column names/indices to use? Convention is to take first row if multiple matches.

@hadley
Copy link
Member Author

hadley commented Jul 28, 2014

I've implemented the generic and basic methods. Let me know when you've added this feature and I'll add the docs & some tests.

@romainfrancois
Copy link
Member

Done, I get this now:

> df <- data.frame( x = rep(1:4, each = 4), y = rep(1:8, each  =  2), z=  1:16)
> distinct_impl( df, c("x", "y" ) )
  x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 3 6
7 4 7
8 4 8

@hadley hadley assigned hadley and unassigned romainfrancois Sep 11, 2014
@hadley hadley closed this as completed in 9f17598 Sep 25, 2014
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants