Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserve order of original dataset when using group_by()? #2159

Closed
jwhendy opened this issue Oct 2, 2016 · 9 comments
Closed

preserve order of original dataset when using group_by()? #2159

jwhendy opened this issue Oct 2, 2016 · 9 comments

Comments

@jwhendy
Copy link

jwhendy commented Oct 2, 2016

Feel free to let me know that this is not a bug. I was just puzzled by the re-ordering of my grouped column. I specifically converted to character in my input df in order to not have to constantly re-order the columns for plotting in ggplot2. This isn't the original data, but here's a reproducible example:

library(dplyr)
set.seed(1)
dat <- data.frame(lett = rep(letters[3:1], each = 3), vals = rnorm(9, 5, 2))

## get rid of default factor application to character col
dat$lett <- as.character(dat$lett)

## get means by lett column
dat_means <- dat %>% group_by(lett) %>% summarise(ave = mean(vals))
dat_means

Result, even without factors:

# A tibble: 3 × 2
   lett      ave
  <chr>    <dbl>
1     a 6.201023
2     b 5.736213
3     c 4.147707

My actual data contains month names, so consider that "c", "b", "a" is already the correct monthly order. What I expected/hoped for was that the groups could be fed into summarise() as they appear in the data frame. Otherwise, to get it to play nicely with ggplot2 I've just been going back to factors since I can easily specify an order. This:

lett_orders <- unique(dat$lett)

dat_means$lett <- factor(dat_means$lett, levels = lett_orders)

It would be nice not to re-specify the ordering I already have in the data source to the results of dplyr's group_by() whenever I use it.

Is there a defacto way to not have to do this? Or what is the suggested route to get back to the ordering that appears in the original data set for, say, plotting discrete variables in the correct order?

@jwhendy jwhendy changed the title preserve order of original data.frame using group_by() preserve order of original dataset when using group_by()? Oct 2, 2016
@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

Thanks, we should preserve order when grouping. Would you like to contribute a testthat test for the desired behavior?

@jwhendy
Copy link
Author

jwhendy commented Nov 10, 2016

@krlmlr I'll have to look into that package; I've never done something like that before. Would wrapping the above into a test script work?

If this is meant to serve as a dplyr official test, I might expand it a bit to see about the behavior of multiple columns and grouping by various of them. I haven't fully explored the sorting nature of dplyr with a bunch of columns/grouping options.

Edit: I also don't know what is intended from dplyr. Part of posting here was that it's conceivable that one would want sorted groups, so at least half of this report was meant as a sanity check.

  • You're confirming that dplyr should leave your data alone and run through grouped variables in the order they appear in the data set (position of first unique instance)?
  • Similarly, is the intention the same for factors, or should group_by() respect (as in output in the order of) factor levels?

@krlmlr
Copy link
Member

krlmlr commented Nov 10, 2016

I may have misread your post. I'm not sure if we want to maintain the original order when grouping after all, #2192 is related. To solve your problem you could turn your character column to a factor with the levels in the order you need for plotting. If the order of the values in the data frame is correct, you can use forcats::fct_inorder().

@jwhendy
Copy link
Author

jwhendy commented Nov 11, 2016

True, as cited above I can also go to factors after group_by(), and just checked to verify I could do it before as well. I took that other post to mean Hadley thought the original order should be preserved.

If the group_by() vector is a character, is there a preference toward alphabetical order vs. order of appearance? If so, or if there's another method, at the very least I think dplyr could add a note to this effect somewhere easily stumbled upon. "If the grouping variable is a factor, the level order will be preserved; otherwise the output will order the results by grouping variables in alphabetical order [or fill-in-blank method]."

For whatever reason, in my mind I sort of imagine it grouping by order of appearance (first variable is c, so it goes through, finds the rest, calculates the function, and stores it, then moves onto the next unique occurrence, groups those, etc.). Not saying this is right by any means, but technically I didn't ask for a sort, I just asked for grouping.

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2016

You can also go to factors before grouping:

tibble(a = factor(letters, levels = rev(letters))) %>% count(a)

@hadley: Is there a specific reason why group_by() also sorts?

@hadley
Copy link
Member

hadley commented Nov 11, 2016

group_by() doesn't sort, but summarise() does. It's easier than deriving the order from the data, which might not be sorted either.

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2016

group_by() for data frames computes group indices, which are then used by summarise(). Currently they are returned in sorted order, not in order of appearance. I guess sorting actually requires more work than just picking up the items in order.

@hadley hadley closed this as completed Feb 16, 2017
@ghaarsma
Copy link

ghaarsma commented Jan 2, 2018

It seems that dplyr's group_by does sort, at least for character, integer and numeric. It does maintain order for factor. Tested with dplyr 0.7.4:

set.seed(4)
char   <- sample(LETTERS[1:20],40,replace = TRUE)
int    <- sample(1L:20L,40,replace = TRUE)
double <- sample(runif(20),40,replace = TRUE)

x <- tibble(char,int,double,fact=factor(char,levels = unique(char)))

# All group_by results are sorted except the factor
group_by(x,char) %>% do(.[1,'char'])
group_by(x,int) %>% do(.[1,'int'])
group_by(x,double) %>% do(.[1,'double'])
group_by(x,fact) %>% do(.[1,'fact'])

# If group_by does not sort, the first indices should contain the first element (zero-based)
# This is only true for the factor
g <- group_by(x,char);attr(g,'indices')[[1]]
g <- group_by(x,int);attr(g,'indices')[[1]]
g <- group_by(x,double);attr(g,'indices')[[1]]
g <- group_by(x,fact);attr(g,'indices')[[1]]

Not sure why group_by is sorting. It seems like it's unnecessary including the additional computational effort. This would make the behavior more like the base function unique or dplyr function distinct, which does not sort either.

Sometimes sorting is nice, so perhaps it could be an option. If the behavior remains as is, perhaps we can add a sorting note to the group_by documentation.

@krlmlr
Copy link
Member

krlmlr commented Jan 2, 2018

Thanks. Would you mind filing a new issue? This comment is likely to get lost otherwise. Please make sure to add references to existing discussions.

If you argue with computational effort, we'd need to assess the actual impact on run time. The new gprofiler package is an option (work in progress, documentation coming very soon, currently Linux only).

@lock lock bot locked as resolved and limited conversation to collaborators Jul 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants