Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow ungroup to specify removal of grouping variable #3760

Closed
ggrothendieck opened this issue Aug 21, 2018 · 10 comments · Fixed by #4671
Closed

Allow ungroup to specify removal of grouping variable #3760

ggrothendieck opened this issue Aug 21, 2018 · 10 comments · Fixed by #4671
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 wip work in progress

Comments

@ggrothendieck
Copy link

ggrothendieck commented Aug 21, 2018

A common case is that one constructs a grouping variable in group_by but only needs it for the duration of the group_by so afterwards one must use select to get rid of it as in the example below. It would be pleasingly symmetric if ungroup could remove the added column just as group_by adds it so

ungroup(-g)

would be the same as

ungroup %>%
select(-g)

Thus in this example taken from https://stackoverflow.com/questions/51939874/referencing-previous-column-value-as-column-is-created/51940343#51940343

test <- structure(list(i = c(0, 1, 2, 3, 4, 0, 1, 2, 3, 4), chng = c(0, 
0.031, 0.005, -0.005, 0.017, 0, 0.012, 0.003, -0.013, -0.005), 
    indx = c(1, 1.031, 1.037, 1.031, 1.048, 1, 1.012, 1.015, 
    1.002, 0.997)), class = "data.frame", row.names = c(NA, -10L
))

test %>%
  group_by(g = cumsum(i == 0)) %>%
  mutate(indx = cumprod(chng + 1)) %>%
  ungroup %>%
  select(-g)

we could write using one fewer statement, i.e. the last two lines of code above are combined into the last line below.

test %>%
  group_by(g = cumsum(i == 0)) %>%
  mutate(indx = cumprod(chng + 1)) %>%
  ungroup(-g)

Note the reduced line count and improved symmetry.

@romainfrancois
Copy link
Member

🤔 ungroup does have an ... it does not use:

> dplyr:::ungroup.grouped_df
function(x, ...) {
  ungroup_grouped_df(x)
}
<bytecode: 0x1026547e8>
<environment: namespace:dplyr>

but I'm not sure about having ungroup also perform selection

@mkoohafkan
Copy link

mkoohafkan commented Oct 5, 2018

Seems to me that incorporating this kind of logic into #3721 would be the better solution for this use case.

I do think it would be neat if ungroup could selectively remove some groupings but not others, e.g.

mtcars %>% group_by(gear, carb, cyl) %>% ungroup(cyl)

would be equivalent to

mtcars %>% group_by(gear, carb, cyl) %>% group_by(gear, carb)

which is how I first interpreted the title of this issue.

@ggrothendieck
Copy link
Author

Here is another example taken from https://stackoverflow.com/questions/52906985/merging-of-duplicate-rows-that-have-misspelled-variables/52907932#52907932

library(phonics)
library(dplyr)

# create test data
Lines <- "CAR MPG
Mazda 5
Mazzda 2
Mzda 1"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, strip.white = TRUE)

# process
DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup %>%
  select(-key)

With the feature under discussion this would simplify to the shorter and more symmetric:

DF %>% 
  group_by(key = soundex(CAR)) %>%
  summarize(CAR = toString(CAR), MPG = sum(MPG)) %>%
  ungroup(-key)

@ggrothendieck
Copy link
Author

ggrothendieck commented Oct 21, 2018

@mkoohafkan, The way group_by currently works is that if you want to incrementally add a variable specify group_by(new_var, add = TRUE).

I suppose there is the question of whether add=TRUE means add the variable to the group_by or really means modify the group_by and replace it with a new group_by. In this latter case it would make sense to write group_by(-cyl, add = TRUE) to remove cyl from the group_by while leaving the other group_by variables in effect rather than using ungroup for that.

Another possibility is to use ungroup(cyl, subtract = TRUE) for that analogously to group_by(new_var, add = TRUE).

One other point is that I don't think incrementally adding and removing parts of a group_by is that frequently encountered whereas I have repeated encountered the ungroup %>% select(-var) sequence.

@mkoohafkan
Copy link

mkoohafkan commented Oct 30, 2018

@ggrothendieck thought about this more and I agree with your statements that

  1. using e.g. ungroup(cyl) to drop the column cyl is symmetric and
  2. using group_by(-cyl) to remove a column from an existing grouping would be a bit confusing with the existing add argument. If the add argument to group_by had originally been named update this would be syntactically cleaner, e.g. group_by(cyl, update = TRUE) and group_by(-cyl, update = TRUE).

ungroup(..., subtract = TRUE) looks like a good idea at first but... what would ungroup(cyl, subtract = FALSE) mean?

@yutannihilation
Copy link
Member

yutannihilation commented Oct 31, 2018

group_by() has mutate semantics, not select semantics (c.f. https://dplyr.tidyverse.org/articles/dplyr.html#selecting-operations). I guess you already noticed this when you tried group_by(-cyl, add = TRUE) and saw -cyl became the grouping variable.

dplyr::group_by(mtcars, -cyl)
#> # A tibble: 32 x 12
#> # Groups:   -cyl [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb `-cyl`
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     -6
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     -6
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     -4
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     -6
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     -8
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     -6
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     -8
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     -4
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     -4
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     -6
#> # ... with 22 more rows

Created on 2018-10-31 by the reprex package (v0.2.1)

So, to me, ungroup() should have mutate semantics as well for consistency (though I don't know what it means to mutate when ungrouping...). A possible solution is to implement scoped variants for ungroup()? (e.g. ungroup_at())?

@ggrothendieck
Copy link
Author

ggrothendieck commented Nov 10, 2018

Here is another case where this feature could be used taken from https://stackoverflow.com/questions/53240324/dplyr-collapse-tail-rows-into-larger-groups/53240699#53240699
In this case we are manufacturing a sort key in order to keep the table in its original sorted order.
With the feature underdiscussion the select at the end of the code could be combined into the ungroup and so omitted.

Note how this keeps coming up again and again.

df <- tibble(a = as.factor(1:20), b = c(50, 20, 13, rep(2, 10), rep(1, 7)))
df %>% 
  group_by(sortkey = -b, a = paste0(if_else(b %in% 1:2, "grp", ""), b)) %>%
  summarize(b = sum(b)) %>%
  ungroup %>%
  select(-sortkey)

@maxmoro
Copy link

maxmoro commented Dec 5, 2018

Having a selective ungroup is also very import when calculating percentages of subgroups.

mtcars %>%
  group_by(gear,carb,vs) %>%
  summarise(count=n()) %>%
  group_by(gear,carb) %>% #<< would be better to do ungroup(cyl)
  mutate(perc=count/sum(count)) %>%
  ungroup() %>%
  spread(vs,perc,sep='=')

    gear  carb count `vs=0` `vs=1`
   <dbl> <dbl> <int>  <dbl>  <dbl>
 1     3     1     3   NA      1  
 2     3     2     4    1     NA  
 3     3     3     3    1     NA  
 4     3     4     5    1     NA  
 5     4     1     4   NA      1  
 6     4     2     4   NA      1  
 7     4     4     2    0.5    0.5
 8     5     2     1    0.5    0.5
 9     5     4     1    1     NA  

@hadley
Copy link
Member

hadley commented Dec 10, 2019

I think it would be fine for ungroup() to have select semantics even while group() has action semantics. I'd suggest df %>% ungroup() would continue to work as usual, and df %>% ungroup(x) would remove x from the grouping variables, throwing an error if not currently grouped by x.

@lock
Copy link

lock bot commented Jun 24, 2020

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 wip work in progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants