Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distinct_.grouped_df selects only one instance of each combination of grouping variables #1110

Closed
bergsmat opened this issue Apr 27, 2015 · 4 comments
Labels
Milestone

Comments

@bergsmat
Copy link

@bergsmat bergsmat commented Apr 27, 2015

Consider:

> library(dplyr)
> library(magrittr)
> x <- data.frame(id=c(1,1),val=c('a','b'))
> x %>% group_by(id) %>% distinct
Source: local data frame [1 x 2]
Groups: id

  id val
1  1   a
> x %>% group_by(id) %>% unique
Source: local data frame [2 x 2]
Groups: id

  id val
1  1   a
2  1   b

The help for distinct just says it is an efficient version of unique, but clearly they are different for grouped_df, with distinct stripping repeats of keys not just records. Seems surprising and possibly dangerous. If intended, it is worth a mention in the help or in the vignette e.g. http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html .

Using dplyr ref='c0b28a93dfbebee0a7e5f8aa17d2474ab773e132' on Windows 8.

@romainfrancois romainfrancois self-assigned this Apr 30, 2015
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 30, 2015

Hmm. That's because at least the ay it's implemented, passing no argument to distinct on a grouped data frame means get distinct values of the grouping variables, which is different from getting distinct values using all columns, which is what the non grouped version does.

We'd need I guess to either change the logic in this function:

distinct_.grouped_df <- function(.data, ..., .dots) {
  groups <- lazyeval::as.lazy_dots(groups(.data))
  dist <- distinct_vars(.data, ..., .dots = c(.dots, groups))

  grouped_df(distinct_impl(dist$data, dist$vars), groups(.data))
}

or internally here:

    if( !vars.size() ){
        vars = df.names() ;
    }
    DataFrameVisitors visitors(df, vars) ;

@romainfrancois romainfrancois removed their assignment Apr 30, 2015
@bergsmat
Copy link
Author

@bergsmat bergsmat commented Apr 30, 2015

Actually, the interpretation "get distinct values of the grouping variables" would be fine if only grouping variables were returned. I like the broader understanding (analog to unique). With either, there are quick ways to generate the other. Thanks for considering.

@hadley hadley added this to the 0.5 milestone May 19, 2015
@alexfun
Copy link

@alexfun alexfun commented Jul 7, 2015

Please add this behaviour of distinct on grouped data frames to the helpfile. Thanks.

http://stackoverflow.com/questions/31259518/why-does-dplyrdistinct-behave-like-this-for-grouped-data-frames#31259587

@hadley
Copy link
Member

@hadley hadley commented Oct 22, 2015

Maybe distinct should only keep variables that are explicitly mentioned? I think keeping all the variables is confusing. We could have an option to revert to the old behaviour. Related to #1150

dpastoor referenced this issue in ronkeizer/PKPDsim Oct 26, 2015
@hadley hadley closed this in b3179f9 Mar 14, 2016
zkamvar added a commit to zkamvar/dplyr that referenced this issue Mar 16, 2016
To make tidyverse#1110 compatible if no vars are selected.
zkamvar added a commit to zkamvar/dplyr that referenced this issue Mar 24, 2016
To make tidyverse#1110 compatible if no vars are selected.
zkamvar added a commit to zkamvar/dplyr that referenced this issue Mar 24, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants