Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinct without arguments on a grouped df #2476

Closed
rubenarslan opened this issue Feb 27, 2017 · 5 comments
Closed

Distinct without arguments on a grouped df #2476

rubenarslan opened this issue Feb 27, 2017 · 5 comments
Labels

Comments

@rubenarslan
Copy link

@rubenarslan rubenarslan commented Feb 27, 2017

Before 0.5.0 distinct() on a df with groups had the same result as using unique(). Now, it shows unique groups.
This is not documented clearly in the help file or the release notes. I'd prefer if the behaviour were changed back, but alternatively, it should be mentioned in the help file and release notes.


This problem came up in the comments on #1981 again (which was closed, as it was about a different bug) and there it wasn't clear to @krlmlr that this was intended. Reading previous discussion at #2107 and #1110 it seems that this change was intended to make things more clear.
I think at least a few users agree that this is not more clear, as it feels like two different operations are activated by the same call and it makes pipeline order matter unduly.

Five examples

0. old behaviour?

iris %>% group_by(Species) %>% distinct(.keep_all = TRUE) 
#> Source: local data frame [3 x 5]
#> Groups: Species [3]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
#>          <dbl>       <dbl>        <dbl>       <dbl>     <fctr>
#> 1          5.1         3.5          1.4         0.2     setosa
#> 2          7.0         3.2          4.7         1.4 versicolor
#> 3          6.3         3.3          6.0         2.5  virginica
# first row for each species is kept, 3 rows

1. returns unique values of species, 3 rows

iris %>% group_by(Species) %>% distinct()  
#> Source: local data frame [3 x 1]
#> Groups: Species [3]
#> 
#>      Species
#>       <fctr>
#> 1     setosa
#> 2 versicolor
#> 3  virginica

2. distinct rows, species as a group, 149 rows

iris %>% distinct() %>% group_by(Species)
#> Source: local data frame [149 x 5]
#> Groups: Species [3]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 139 more rows

3. returns distinct values of Sepal.Length within species

iris %>% group_by(Species) %>% distinct(Sepal.Length)  
#> Source: local data frame [57 x 2]
#> Groups: Species [3]
#> 
#>    Sepal.Length Species
#>           <dbl>  <fctr>
#> 1           5.1  setosa
#> 2           4.9  setosa
#> 3           4.7  setosa
#> 4           4.6  setosa
#> 5           5.0  setosa
#> 6           5.4  setosa
#> 7           4.4  setosa
#> 8           4.8  setosa
#> 9           4.3  setosa
#> 10          5.8  setosa
#> # ... with 47 more rows

4. same as 3.

iris %>% distinct(Species, Sepal.Length) %>% group_by(Species) 
#> Source: local data frame [57 x 2]
#> Groups: Species [3]
#> 
#>    Species Sepal.Length
#>     <fctr>        <dbl>
#> 1   setosa          5.1
#> 2   setosa          4.9
#> 3   setosa          4.7
#> 4   setosa          4.6
#> 5   setosa          5.0
#> 6   setosa          5.4
#> 7   setosa          4.4
#> 8   setosa          4.8
#> 9   setosa          4.3
#> 10  setosa          5.8
#> # ... with 47 more rows

Desired behaviour

Desired behaviour for distinct() on a grouped tbl what I'd expect based on
close reading of docs (i.e. ... expands to all variables, including group).

iris %>% group_by(Species) %>% distinct(Sepal.Length, Sepal.Width, Petal.Length, 
  Petal.Width)
#> Source: local data frame [149 x 5]
#> Groups: Species [3]
#> 
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
#> 1           5.1         3.5          1.4         0.2  setosa
#> 2           4.9         3.0          1.4         0.2  setosa
#> 3           4.7         3.2          1.3         0.2  setosa
#> 4           4.6         3.1          1.5         0.2  setosa
#> 5           5.0         3.6          1.4         0.2  setosa
#> 6           5.4         3.9          1.7         0.4  setosa
#> 7           4.6         3.4          1.4         0.3  setosa
#> 8           5.0         3.4          1.5         0.2  setosa
#> 9           4.4         2.9          1.4         0.2  setosa
#> 10          4.9         3.1          1.5         0.1  setosa
#> # ... with 139 more rows

Possible help files changes:

... : Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
.keep_all: If TRUE, keep all variables in .data. If a combination of ... is not distinct, this keeps the first row of values.

Possible changes:
If omitted on a tbl without groups, will use all variables. If omitted on a tbl with groups, will list unique groups, but will list unique combinations within group if specified.

.keep_all: Defaults to FALSE if ... is omitted on a non-grouped tbl.

Motivating use case to change it so that ... expands to all variables if omitted

I think, this is not the old behaviour or the behaviour that .keep_all = TRUE elicits, I think it can currently only be obtained by explicitly listing all variable names or by calling unique.

  1. Consistency with non-grouped tbl case. This would mean that all variables are checked for uniqueness (in each group/including group as a var), because ... expands to all variable names.
  2. Consistency with call with ... defined. Example 3 and 4 are equivalent, so I'd naïvely expect 1 and 2 to also be equivalent.
@rubenarslan rubenarslan changed the title Distinct without arguments on a gru Distinct without arguments on a grouped df Feb 27, 2017
@hadley
Copy link
Member

@hadley hadley commented Feb 27, 2017

Would you mind explaining exactly how you think it should behave, and why? I think you're arguing that distinct() with no arguments should basically ignoring grouping?

Also please use the reprex package to construct reprexes that include the output. That makes it easier to read this issue without having to switch back and forth to R.

@rubenarslan
Copy link
Author

@rubenarslan rubenarslan commented Feb 27, 2017

Sorry, I didn't get that reprex was a reference to a package, not just a shorthand. Added it now.

Exact behaviour:
I suggest distinct() without dots should behave as I understood the documentation ("If omitted, will use all variables."), i.e. use all variables, including the group variable.
This would not be ignoring grouping, because getting all distinct rows in each group is the same as getting all distinct rows including the group var.

My last example was supposed to show what I'd prefer. It's the same behaviour you get when calling unique on a tbl.

@hadley
Copy link
Member

@hadley hadley commented Feb 27, 2017

Ok, thanks. I understand the problem, and I'll make the change in the next week or two (this need to wait until the tidyeval conversion is complete)

@hadley
Copy link
Member

@hadley hadley commented Apr 4, 2017

Just to be clear, you want these two calls to be equivalent, right?

iris %>% group_by(Species) %>% distinct()
iris %>% distinct() %>% group_by(Species)

So effectively, grouping has no impact on distinct(), except that the output will be grouped if the input is.

@rubenarslan
Copy link
Author

@rubenarslan rubenarslan commented Apr 4, 2017

@hadley hadley closed this in 1d956ef Apr 4, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants