preserve order of original dataset when using group_by()? #2159

jwhendy · 2016-10-02T17:49:55Z

Feel free to let me know that this is not a bug. I was just puzzled by the re-ordering of my grouped column. I specifically converted to character in my input df in order to not have to constantly re-order the columns for plotting in ggplot2. This isn't the original data, but here's a reproducible example:

library(dplyr)
set.seed(1)
dat <- data.frame(lett = rep(letters[3:1], each = 3), vals = rnorm(9, 5, 2))

## get rid of default factor application to character col
dat$lett <- as.character(dat$lett)

## get means by lett column
dat_means <- dat %>% group_by(lett) %>% summarise(ave = mean(vals))
dat_means

Result, even without factors:

# A tibble: 3 × 2
   lett      ave
  <chr>    <dbl>
1     a 6.201023
2     b 5.736213
3     c 4.147707

My actual data contains month names, so consider that "c", "b", "a" is already the correct monthly order. What I expected/hoped for was that the groups could be fed into summarise() as they appear in the data frame. Otherwise, to get it to play nicely with ggplot2 I've just been going back to factors since I can easily specify an order. This:

lett_orders <- unique(dat$lett)

dat_means$lett <- factor(dat_means$lett, levels = lett_orders)

It would be nice not to re-specify the ordering I already have in the data source to the results of dplyr's group_by() whenever I use it.

Is there a defacto way to not have to do this? Or what is the suggested route to get back to the ordering that appears in the original data set for, say, plotting discrete variables in the correct order?

The text was updated successfully, but these errors were encountered:

krlmlr · 2016-11-07T16:36:56Z

Thanks, we should preserve order when grouping. Would you like to contribute a testthat test for the desired behavior?

jwhendy · 2016-11-10T16:38:28Z

@krlmlr I'll have to look into that package; I've never done something like that before. Would wrapping the above into a test script work?

If this is meant to serve as a dplyr official test, I might expand it a bit to see about the behavior of multiple columns and grouping by various of them. I haven't fully explored the sorting nature of dplyr with a bunch of columns/grouping options.

Edit: I also don't know what is intended from dplyr. Part of posting here was that it's conceivable that one would want sorted groups, so at least half of this report was meant as a sanity check.

You're confirming that dplyr should leave your data alone and run through grouped variables in the order they appear in the data set (position of first unique instance)?
Similarly, is the intention the same for factors, or should group_by() respect (as in output in the order of) factor levels?

krlmlr · 2016-11-10T17:09:04Z

I may have misread your post. I'm not sure if we want to maintain the original order when grouping after all, #2192 is related. To solve your problem you could turn your character column to a factor with the levels in the order you need for plotting. If the order of the values in the data frame is correct, you can use forcats::fct_inorder().

jwhendy · 2016-11-11T01:56:07Z

True, as cited above I can also go to factors after group_by(), and just checked to verify I could do it before as well. I took that other post to mean Hadley thought the original order should be preserved.

If the group_by() vector is a character, is there a preference toward alphabetical order vs. order of appearance? If so, or if there's another method, at the very least I think dplyr could add a note to this effect somewhere easily stumbled upon. "If the grouping variable is a factor, the level order will be preserved; otherwise the output will order the results by grouping variables in alphabetical order [or fill-in-blank method]."

For whatever reason, in my mind I sort of imagine it grouping by order of appearance (first variable is c, so it goes through, finds the rest, calculates the function, and stores it, then moves onto the next unique occurrence, groups those, etc.). Not saying this is right by any means, but technically I didn't ask for a sort, I just asked for grouping.

krlmlr · 2016-11-11T06:28:58Z

You can also go to factors before grouping:

tibble(a = factor(letters, levels = rev(letters))) %>% count(a)

@hadley: Is there a specific reason why group_by() also sorts?

hadley · 2016-11-11T14:00:54Z

group_by() doesn't sort, but summarise() does. It's easier than deriving the order from the data, which might not be sorted either.

krlmlr · 2016-11-11T14:23:24Z

group_by() for data frames computes group indices, which are then used by summarise(). Currently they are returned in sorted order, not in order of appearance. I guess sorting actually requires more work than just picking up the items in order.

ghaarsma · 2018-01-02T01:45:33Z

It seems that dplyr's group_by does sort, at least for character, integer and numeric. It does maintain order for factor. Tested with dplyr 0.7.4:

set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

# All group_by results are sorted except the factor
group_by(x,char) %>% do(.[1,'char'])
group_by(x,int) %>% do(.[1,'int'])
group_by(x,double) %>% do(.[1,'double'])
group_by(x,fact) %>% do(.[1,'fact'])

# If group_by does not sort, the first indices should contain the first element (zero-based)
# This is only true for the factor
g <- group_by(x,char);attr(g,'indices')[[1]]
g <- group_by(x,int);attr(g,'indices')[[1]]
g <- group_by(x,double);attr(g,'indices')[[1]]
g <- group_by(x,fact);attr(g,'indices')[[1]]

Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function unique or dplyr function distinct, which does not sort either.

Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.

krlmlr · 2018-01-02T09:15:48Z

Thanks. Would you mind filing a new issue? This comment is likely to get lost otherwise. Please make sure to add references to existing discussions.

If you argue with computational effort, we'd need to assess the actual impact on run time. The new gprofiler package is an option (work in progress, documentation coming very soon, currently Linux only).

jwhendy changed the title ~~preserve order of original data.frame using group_by()~~ preserve order of original dataset when using group_by()? Oct 2, 2016

jwhendy added the data frame label Nov 7, 2016

hadley closed this as completed Feb 16, 2017

ghaarsma mentioned this issue Jan 2, 2018

group_by is sorting and does not maintain original order #3279

Closed

lock bot locked as resolved and limited conversation to collaborators Jul 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preserve order of original dataset when using group_by()? #2159

preserve order of original dataset when using group_by()? #2159

jwhendy commented Oct 2, 2016

krlmlr commented Nov 7, 2016

jwhendy commented Nov 10, 2016 •

edited

krlmlr commented Nov 10, 2016

jwhendy commented Nov 11, 2016 •

edited

krlmlr commented Nov 11, 2016

hadley commented Nov 11, 2016

krlmlr commented Nov 11, 2016

ghaarsma commented Jan 2, 2018 •

edited

krlmlr commented Jan 2, 2018 •

edited

preserve order of original dataset when using group_by()? #2159

preserve order of original dataset when using group_by()? #2159

Comments

jwhendy commented Oct 2, 2016

krlmlr commented Nov 7, 2016

jwhendy commented Nov 10, 2016 • edited

krlmlr commented Nov 10, 2016

jwhendy commented Nov 11, 2016 • edited

krlmlr commented Nov 11, 2016

hadley commented Nov 11, 2016

krlmlr commented Nov 11, 2016

ghaarsma commented Jan 2, 2018 • edited

krlmlr commented Jan 2, 2018 • edited

jwhendy commented Nov 10, 2016 •

edited

jwhendy commented Nov 11, 2016 •

edited

ghaarsma commented Jan 2, 2018 •

edited

krlmlr commented Jan 2, 2018 •

edited