Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is slice() meant to give the same results as filter with row_number() on grouped data frames? #2192

Closed
alexfun opened this issue Oct 22, 2016 · 7 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@alexfun
Copy link

alexfun commented Oct 22, 2016

From:

http://stackoverflow.com/questions/40187530/in-dplyr-0-5-0-why-does-slice1-not-give-the-same-results-as-filterrow-number?noredirect=1#comment67643041_40187530

Using example data

tmp_df2 <- data.frame(a = c(1, 3, 2, 4), b = c(1, 2, 3, 4))

The operations


tmp_df2 %>%
    group_by(a) %>%
    slice(1)

produces a different row order to

tmp_df2 %>%
    group_by(a) %>%
    filter(row_number() == 1)
@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

Confirmed. @hadley: Do we guarantee any particular row order after group_by()?

@hadley
Copy link
Member

hadley commented Nov 7, 2016

I think we should preserve the existing order of the rows (i.e. it's a "stable" split)

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

@alexfun: Would you like to contribute a testthat test?

@alexfun alexfun added the bug an unexpected problem or unintended behavior label Nov 7, 2016
@alexfun
Copy link
Author

alexfun commented Nov 7, 2016

@krlmlr I would be happy to. Let me know if you have any specific requests.

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

@hadley @alexfun: I was confused.

The following two give identical results (ordered by the grouping variable):

tmp_df2 <- data.frame(a = c(1, 3, 2, 4), b = c(1, 2, 3, 4))

tmp_df2 %>%
  group_by(a) %>%
  slice(1) %>% 
  ungroup

tmp_df2 %>%
  group_by(a) %>%
  summarize(b = b[[1]]) %>% 
  ungroup

The filter example returns the data in order of appearance:

tmp_df2 %>%
    group_by(a) %>%
    filter(row_number() == 1)

The question is: Is the slice() a summarize-like or a filter-like operation? I tend to think it's more like a filter operation, in this case the observed behavior is a bug.

@alexfun
Copy link
Author

alexfun commented Nov 7, 2016

I would argue that summarise should also return the grouping variables in the original order presented. I had no idea that summarise exhibits this behaviour as well. This in particular is not good for my workflow, as I tend to like to cbind summary data with data frames with rows sorted in a particular order. To get around the fact that summarise rearranges rows, I will need to have key columns in both summarised data frames and other data frames, and do a join instead of a straight cbind.

@hadley
Copy link
Member

hadley commented Feb 22, 2017

Minimal reprex:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(a = c(2, 1), b = c("x", "y")) %>% group_by(a)

df %>% slice(1)
#> Source: local data frame [2 x 2]
#> Groups: a [2]
#> 
#>       a     b
#>   <dbl> <chr>
#> 1     1     y
#> 2     2     x
df %>% filter(row_number() == 1)
#> Source: local data frame [2 x 2]
#> Groups: a [2]
#> 
#>       a     b
#>   <dbl> <chr>
#> 1     2     x
#> 2     1     y

Thinking on it more, I don't think there's any guarantee that the row order should be the same. Relying on row order (instead of doing a join) is extremely dangerous because it will silently fail. So I don't think this is like to get high enough up on my priority list to be fixed.

@hadley hadley closed this as completed Feb 22, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants