Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore grouping when arrange()ing #1206

Closed
ajdamico opened this issue Jun 9, 2015 · 26 comments
Closed

Ignore grouping when arrange()ing #1206

ajdamico opened this issue Jun 9, 2015 · 26 comments
Labels
Milestone

Comments

@ajdamico
Copy link
Contributor

@ajdamico ajdamico commented Jun 9, 2015

set.seed( 2014 )

library(dplyr)
library(nycflights13)


x <- flights
x[ sample( seq( nrow( x ) ) , 300000 ) , 'origin' ] <- NA
x[ sample( seq( nrow( x ) ) , 300000 ) , 'dest' ] <- NA


# arrange( desc( num ) ) fails

x %>%
    group_by( dest , origin ) %>%
        summarize( 
            num = n() ,  
            something = n_distinct( arr_time ) ,
            another = round( 100 * mean( is.na( arr_time ) ) , 1 ) ,
            yet = round( 100 *mean( arr_time , na.rm = TRUE ) , 1 )
        ) %>%
            arrange( desc( num ) )
@jmarywien
Copy link

@jmarywien jmarywien commented Jul 2, 2015

I think what you are expecting is for num to be arranged in descending order, regardless of flight destination and origin.

However, group_by returns a grouped_df object, so any subsequent arranges will be applied within group according to https://github.com/hadley/dplyr/blob/master/src/arrange.cpp#L13.

If you want to arrange in descending order, regardless of flight and origin, you must cast the grouped_df to data_frame using

    x %>% data.frame %>% arrange(desc(num))

@ajdamico
Copy link
Contributor Author

@ajdamico ajdamico commented Jul 2, 2015

you're absolutely right, thanks! and shame on me for missing this sentence from http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

grouped arrange() orders first by grouping variables

i've proposed a change to the ?arrange help page.

master...ajdamico:patch-1

@ajdamico ajdamico closed this Jul 2, 2015
@hadley hadley reopened this Jul 4, 2015
@hadley
Copy link
Member

@hadley hadley commented Jul 4, 2015

I'm reopening this because I think it's rather confusing, and I'm considering changing back to the old behaviour where arrange ignores groups.

@jmarywien
Copy link

@jmarywien jmarywien commented Jul 4, 2015

I think that's a good idea Hadley. Have a good weekend.
On Sat, Jul 4, 2015 at 2:48 AM Hadley Wickham notifications@github.com
wrote:

I'm reopening this because I think it's rather confusing, and I'm
considering changing back to the old behaviour where arrange ignores groups.


Reply to this email directly or view it on GitHub
#1206 (comment).

@stevencb
Copy link

@stevencb stevencb commented Jul 4, 2015

To be honest, I think it makes sense that arrange would sort within groups (i.e. I don't think it should be changed back). I seems to make sense that if you apply something like arrange to a grouped data frame, that it acts within groups, not globally. One can of course ungroup to get the other behavior. I actually find the current behavior more intuitive.

If you do change it back, can you provide an option to arrange just within groups ? I have found this useful.

@stevencb
Copy link

@stevencb stevencb commented Jul 4, 2015

FYI,
An alternative to

x %>% data.frame %>% arrange(desc(num))

is

x %>% ungroup %>% arrange(desc(num))

Using "ungroup" will maintain the class types of x. So if x is a tbl_df, it will still be after ungroup but not after piping to data.frame.

@hadley
Copy link
Member

@hadley hadley commented Jul 4, 2015

It's not more intuitive to most people

@kevinushey
Copy link
Contributor

@kevinushey kevinushey commented Jul 5, 2015

IMO there should be separate functions for sorting within groups vs sorting without groups.

@nicholasjhorton
Copy link
Contributor

@nicholasjhorton nicholasjhorton commented Jul 5, 2015

Would a separate option (defaulting to sorting without groups) be feasibe to consider?

Nick

On Jul 4, 2015, at 8:37 PM, Kevin Ushey notifications@github.com wrote:

IMO there should be separate functions for sorting within groups vs sorting without groups.


Reply to this email directly or view it on GitHub.

Nicholas Horton
Professor of Statistics
Department of Mathematics and Statistics, Amherst College
Box 2239, 31 Quadrangle Dr
Amherst, MA 01002-5000
https://www.amherst.edu/people/facstaff/nhorton

@ajdamico
Copy link
Contributor Author

@ajdamico ajdamico commented Jul 5, 2015

perhaps something like arrange and arrange_by -- arrange could be equivalent to x %>% ungroup %>% arrange_by and then arrange_by behaves groupwise like the current arrange function does. since both could be on the same help page, typing ?arrange would make it immediately obvious if a user wants the _by

@hadley
Copy link
Member

@hadley hadley commented Jul 5, 2015

I don't think adding more options or function variants will help here.

@AMarthaller
Copy link

@AMarthaller AMarthaller commented Jul 31, 2015

This bit me today. How about just preserving an arrange if it's called before group_by? Then you can support both with the same existing method. Order of operations would control the behavior.

@AMarthaller
Copy link

@AMarthaller AMarthaller commented Aug 4, 2015

Of course my previous suggestion would not help with the ordering of newly created variables. It is also important to maintain an option for sorting within groups as row_number() queries often leverage specific ordering.

@hadley hadley added the feature label Aug 24, 2015
@hadley hadley added this to the 0.5 milestone Aug 24, 2015
@hadley hadley changed the title example where arrange( desc( ) ) fails Ignore grouping when arrange()ing Aug 24, 2015
@hadley hadley closed this in 726b2a6 Mar 7, 2016
@coolbutuseless
Copy link
Contributor

@coolbutuseless coolbutuseless commented Mar 25, 2016

I'm sorry I found this issue too late! This is a definite "-1" from me, but it's probably too late for the pebbles to vote...

If I have a grouped data.frame, and want to arrange(), I would expect it to arrange within the groups. If I didn't want that behaviour, and wanted global sorting, I would use an 'ungroup()' first.

So in the current CRAN dplyr, I can access both types of sorting - "within groups" and "global".

With this new change, what's the solution now for when I want grouped data to be arranged within groups? There doesn't appear to be any way to access this old behaviour (other than to have to specifiy (again!) the grouping variables as part of the arrange())...

@onurfiliz
Copy link

@onurfiliz onurfiliz commented Mar 31, 2016

I'm also in the camp that is fond of the previous behavior. In any case, this change seems to result in different behavior for local vs sql sources:

library('dplyr')
t <- src_postgres('test') %>% copy_to(mtcars)
f <- function(t) {
   t %>% 
     group_by(cyl) %>% 
     arrange(mpg) %>% 
     mutate(r=as.numeric(row_number())) %>% 
     select(cyl, mpg, r) %>%
     as.data.frame
}
remote <- f(t)
local <- f(t %>% collect(n=Inf))
identical(local, remote)
cbind(local, remote)

wherein postgres still orders by c('cyl', 'mpg') and tbl_df orders by 'mpg' only. I'm not sure how one would get the older behavior from postgres if remote sources are modified to match the new one.

@AMarthaller from what I can see in the output above, the behavior with window functions like row_number is still doing what it's expected to. It's just the output ordering that's different.

Also note that for local sources it is possible to get the old behavior with do:

mtcars %>% group_by(cyl) %>% do({ tbl_df(.) %>% arrange(mpg) %>% mutate(r=as.numeric(row_number())) })

@ghost
Copy link

@ghost ghost commented Apr 9, 2016

IMHO this was the wrong decision and should be reversed. Having arrange ignore grouping is inconsistent with the behavior of the all-important verbs mutate() and summarise(). It breaks a considerable amount of code that I and people on my team have written. Maybe it fixes the complaints of users who didn't understand/appreciate the old API, but it breaks the code of the many users who did.

At the very least, there needs to be an easy way to arrange within groups, whether it's a different function or an option or whatever.

@hadley
Copy link
Member

@hadley hadley commented Apr 9, 2016

I'm not discussing it further, sorry

@instantkaffee
Copy link

@instantkaffee instantkaffee commented Jun 24, 2016

Thats really a pitty. We have production code thats going to break. Not good.

@onurfiliz
Copy link

@onurfiliz onurfiliz commented Jun 25, 2016

FWIW, you can get the old behavior by prefixing the columns in the arrange() call with the ones in the group_by() call, e.g.

mtcars %>% group_by(cyl) %>% arrange(mpg)

vs

mtcars %>% group_by(cyl) %>% arrange(cyl, mpg)

@MilesMcBain
Copy link

@MilesMcBain MilesMcBain commented Jun 28, 2016

I'm not a fan of this decision due to the fact that it breaks the 'groups are persistent and affect everything' idiom. That's the real unintuitive part when you're moving to dplyr from say SQL. Once you adjust your thinking though, it's fine.

HOWEVER, I think the impact on production code is being overstated. Order dependent code using any of the order dependent dplyr functions (e.g. first(), last(), lead(), lag(), row_number() etc) still works within arranged groups as before. I can't actually imagine where this might cause issues other than in the presentation, which you can fix by adding additional parameters to arrange() as @onurfiliz has said.

@sillasgonzaga
Copy link

@sillasgonzaga sillasgonzaga commented Jun 28, 2016

@onurfiliz that's a great idea. See, people, we do not have to come back to the previous code to use arrange within a group.

@Ax3man
Copy link

@Ax3man Ax3man commented Jul 1, 2016

I don't think this is much of an issue, really. But for those who are a little stubborn and/or lazy, this function should produce the old behavior:

arrange_old <- function(.data, ...) {
  dplyr::arrange_(.data, .dots = c(groups(.data), lazyeval::lazy_dots(...)))
}

@piccolbo
Copy link

@piccolbo piccolbo commented Jul 11, 2016

To see why this is wrong, consider the following examples (I know Hadley has vowed to ignore the rest of the conversation. This is on principle. He's not going to change it back in dplyr 0.7. and claim Chambers Himself appeared in his dreams. No way.)

mtcars %>% arrange(mpg)  %>% mutate(min_rank(mpg))

Now visually check it's doing what is expected. Be sure. Now try this

mtcars %>% arrange(mpg)  %>% group_by(carb) %>% mutate(min_rank(mpg)) 

Add a %>% as.data.frame to see all the rows. Check that it's doing what it's supposed to do. The groups are all mixed up in the output, so you can't (well, I can't without a much larger effort). Now the workaround.

mtcars %>% arrange(mpg, carb)  %>% group_by(carb) %>% mutate(min_rank(mpg))

Check now. Isn't it much easier? By the way, this is the same output as postgres with or without the workaround. Yes, this change makes data.frame and postgres srcs behave differently, as long as order of output matters. I did not work hard to create this example. In fact, it's implicit in @MilesMcBain comment. I believe you can replace any other window function for min_rank and the example still works. Moreover, I could not build a single example to support this change. If a data set is grouped according to its metadata, it behaves as grouped for all purposes, it should be printed grouped. It doesn't seem that controversial to me. Not even interesting.

By they way @Ax3man clever solution doesn't work in this case because I put the arrange before the group. The way I understand dplyr sql translation, the order should commute (I am not fully in the know on 0.5. which has a new translation system) . Which means the right point of action is group_by, actually the non-NSE group_by_

group_by_ = 
  function(.data, ..., .dots, add = FALSE) 
    arrange_(dplyr::group_by_(.data, ..., .dots = .dots, add = add), ..., .dots = .dots)

group_by = 
  function(.data, ..., add = FALSE) 
    group_by_(.data, .dots = lazyeval::lazy_dots(...), add = add)

This seems to work in limited testing. One probably could be more surgical and fix only the data frame method with some namespace sorcery.

@Ax3man
Copy link

@Ax3man Ax3man commented Jul 12, 2016

@piccolbo, I guess whether grouped_dfs should be arranged by group by default is a separate issue, no? I don't think that previous versions did that either. Personally, I would expect that my rows stay where I left them, unless I explicitly call arrange. And if I made a point of arranging before doing any grouping, I would certainly expect that my result is not arranged by group. Perhaps I misunderstand your point.

Anyway, I think my function successfully produces the old behavior, but perhaps not your desired behavior.

(And I guess in your third line of code it should read arrange(carb, mpg)?)

@antoine-lizee
Copy link

@antoine-lizee antoine-lizee commented Aug 15, 2016

To me too, this was the wrong decision.

@hadley & co, you are rightfully redesigning the R language with your many packages (thanks again!), but tipping the balance towards 'easy' against 'consistent' is what will eventually jeopardize the new language you are creating.

Having the programmer adding an ungroup call here was the straighforward solution, while having exceptional behavior goes in the way of discoverability and eventually productivity. Given the rest of the dplyr ecosystem, the current behavior of arrange() is a bug - however 'easier' it makes one particular line of code.

I agree that this is per se not a big deal, because not much code should actually rely on arrrange() within group_bys. It's nevertheless a good occasion to discuss:

  • what is a right line between (easy to type) vs (consistent)
  • how we could develop simple processes to have hadley's ecosystem (of programmers) discuss such design choices before it gets coded and published.

Here, it seems that many more are in favor of the old behavior, and @hadley might have reached a different decision had this discussion happened before. Highlighting a couple of major decisions that are breaking changes in the dev section of the NEWS file might be enough ? (on us then to check it often)

By the way @Ax3man, calling people who rely on R in production 'lazy' is quite misguided to me 😄 ('stubborn' might be more accurate though)

@tidyverse tidyverse locked and limited conversation to collaborators Aug 15, 2016
@hadley
Copy link
Member

@hadley hadley commented Aug 15, 2016

As I have said there is no point discussing this further as further changes just make life worse for everyone.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet