Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of group_by() and arrange() does not produce expected result? #491

Closed
hstojic opened this issue Jul 11, 2014 · 4 comments
Closed
Assignees
Labels
Milestone

Comments

@hstojic
Copy link

@hstojic hstojic commented Jul 11, 2014

Hi,

this comes from my post on Stackoverflow, Hadley asked me to file an issue here.

When using dplyr function group_by() and immediately afterwards arrange(), I would expect to get an output where data frame is ordered within groups that I stated in group_by(). My reading of documentation is that this combination should produce such a result, however when I tried it this is not what I get, and googling did not indicate that other people ran into the same issue. Am I wrong in expecting this result?

Here is an example, using the R built-in dataset ToothGrowth:

library(dplyr)
ToothGrowth %>%
  group_by(supp) %>%
  arrange(len)

Running this will produce a data frame where the whole data frame is ordered according to len and not within supp factors.

This is the code that produces the desired output:

ToothGrowth %>%
  group_by(supp) %>%
  do( data.frame(with(data=., .[order(len),] )) )
@hadley
Copy link
Member

@hadley hadley commented Jul 28, 2014

@romainfrancois Do you think this should be fixed in C++ or in R?

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 28, 2014

I can see how we could take advantage of the available grouping information on the C++ side, i.e. we only have to sort within each group. This could be much cheaper, and this is an obvious place where we can parallelise.

Having said that, it is probably easier, if not trivial, to write the R version that just calls the available arrange with the correct list of variables.

@hadley hadley added the bug label Aug 1, 2014
@hadley hadley added this to the 0.3 milestone Aug 1, 2014
@hadley hadley self-assigned this Aug 1, 2014
@hadley
Copy link
Member

@hadley hadley commented Aug 1, 2014

@romainfrancois it seems like arrange is currently doing something with grouping. If I write:

arrange.tbl_df    <- function (.data, ...) {
  arrange_impl(.data, c(groups(.data), dots(...)), environment())
}

then these two things should return the same result, but don't

  df1 <- mtcars %>% group_by(cyl) %>% arrange(disp) %>% ungroup()
  df2 <- mtcars %>% arrange(cyl, disp)

Looking at the C++ code, I think you'll need to handle this. For now, I wouldn't worry about doing this efficiently (by taking advantage of the indices), I think it's probably ok to sort like it's an ungrouped df.

@hadley hadley assigned romainfrancois and unassigned hadley Aug 1, 2014
@hadley
Copy link
Member

@hadley hadley commented Aug 1, 2014

Here's a test case:

test_that("grouped arrange sorts first by group", {
  df1 <- mtcars %>% group_by(cyl) %>% arrange(disp) %>% ungroup()
  df2 <- mtcars %>% arrange(cyl, disp)

  expect_equal(df1, df2)
})

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants