Simplify summarise to collapse data #4232

davidsjoberg · 2019-03-01T09:39:15Z

Hi,

dplyr is awesome, thank you for that. An issue I have come across multiple times when reviewing and writing code is collapsing data with dplyr. Often users want to have multiple summary functions like min, max, mean, first for different columns. summarise_at works great for homogenous tibbles when all columns should be summed up. But since it is destructive of the original tibble it can't be piped with other summarise_at using other aggregation functions.

A common way to write:

# Load latest version av dplyr
if(packageVersion("dplyr") < "0.8") {
  stop("Update dplyr")
} else {
  library(dplyr, quietly = T, warn.conflicts = F)
  }
#> Warning: package 'dplyr' was built under R version 3.4.4

# How users often collapse with dplyr
mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    wt   = sum(wt),
    qsec = sum(wt),
    disp = first(disp),
    drat = first(drat),
    mpg  = min(mpg, na.rm = T),
    hp   = min(hp, na.rm = T),
    vs   = min(vs, na.rm = T),
    am   = mean(am),
    gear = mean(gear),
    carb = mean(carb)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  25.1   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  21.8   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  56.0   360  3.15  10.4   150     0 0.143  3.29  3.5

^{Created on 2019-03-01 by the reprex package (v0.2.1)}

The main problem with the above approach is that is tedious, especially if a tibble have more than 50 columns. It is also prone to errors since the user have to double reference column and functions. Basically, it is hard to get efficiency and dynamic code when collapsing data.

A solution would be to let users specify chunks of columns that should be summarised with the same function (and keep original name). I tried to make a simple take on how this would work.

### Creating the functions chunk and summarise_by

# Load latest version av dplyr
if(packageVersion("dplyr") < "0.8") {
  stop("Update dplyr")
} else {
  library(dplyr, quietly = T, warn.conflicts = F)
}
#> Warning: package 'dplyr' was built under R version 3.4.4


# chunk
chunk <- function (.funs, .vars, ...) 
{
  syms <- syms(substring(as.character(.vars), 2))
  funs <- dplyr:::as_fun_list(.funs, enquo(.funs), rlang::caller_env(), ...)
  dplyr:::manip_apply_syms(funs, syms, NULL)
}

# summarise_by (chunks)
summarise_by <- function(.tbl, ...){
  dplyr:::summarise_impl(.tbl, c(...), environment(), rlang::caller_env())
}

### Apply solution
mtcars %>% 
  group_by(cyl) %>% 
  summarise_by(
     chunk(sum,            vars(wt, qsec)), 
     chunk(first,          vars(disp, drat)),
     chunk(min, na.rm = T, vars(mpg, hp, vs)),
     chunk(mean,           vars(am, gear, carb))
     )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

^{Created on 2019-03-01 by the reprex package (v0.2.1)}

Some things that I have not solved are incorporation of tidyselect and probably it could work without creating a new summarise_*- function. Preferred syntax would be:

# mtcars %>% 
#   group_by(cyl) %>% 
#   summarise(
#     wt = sum(wt),
#     chunk(first,          vars(disp, drat)),
#     chunk(min, na.rm = T, vars(matches("hp|vs")))
#   )

Is this an issue that is worth digging into?

The text was updated successfully, but these errors were encountered:

romainfrancois · 2019-03-04T17:33:21Z

We are aware that this is a problem, and are running various experimentations about it.

in 0.9.* you'll be able to return data frames (with 1 row) from summarise expressions
your implementation is somewhat similar to what I've played with in the splice 📦

library(dplyr, warn.conflicts = FALSE)
library(splice)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    !!!at_(vars(wt, qsec), sum), 
    !!!at_(vars(disp, drat), first),
    !!!at_(vars(mpg, hp, vs), min, na.rm = TRUE),
    !!!at_(vars(am, gear, carb), mean)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

Here the ... are used for extra function parameters, so the tidy selection has to be captured by vars()

^{Created on 2019-03-04 by the reprex package (v0.2.1.9000)}

but I currently prefer the approach I'm taking with the dance 📦 💃 🕺 with the tango() + swing() combo:

library(purrr)
library(dplyr, warn.conflicts = FALSE)
library(dance)
mtcars %>% 
  group_by(cyl) %>% 
  tango(
    swing(sum, wt, qsec), 
    swing(first, disp, drat), 
    swing(~min(., na.rm = TRUE), mpg, hp, vs), 
    swing(mean, am, gear, carb)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

The take is slightly different here, as swing() takes only one function, or lambda, and the ... are used to give tidy selection, which here is just enumeration of variables.

davidsjoberg · 2019-03-24T18:03:55Z

Thanks for your answer. Both approaches are really good! The nice with dance is that no !!! are needed. But splice is used in conjuction with summarise which I find intuitive.

It took me some time to understand how splice found its way in the environment to get tidyselect to work. But I finally managed to work around the !!! syntax of splice while having syntax that or more similiar to dance. Normal summarise-syntax also works if users prefer to pass named lists as usual.

# Functions
library(tidyverse)

summarise2 <- function (.data, ...) {
  dots <- quos(...)
  dots <- unlist(map_if(dots, names(dots) == "", ~rlang::eval_tidy(.)))
  dplyr:::summarise_impl(.data, dots, environment(), rlang::caller_env())
}

eval_context <- function (...) {
  calls <- sys.calls()
  frames <- sys.frames()
  n <- length(frames)
  list(.data = frames[[n - 7]]$.data, .env = frames[[n - 7]], 
       ...)
}

chunk <- function (.funs, .vars, ...)  {
  context <- eval_context()
  dplyr:::manip_at(context$.data, .vars, .funs, enquo(.funs), context$.env, ...)
}

# Example
mtcars %>% 
  group_by(gear) %>% 
  summarise2(
    vs = sum(vs), 
    chunk(mean,  vars(wt, qsec)),
    chunk(min,   vars(matches("drat|cyl"))),
    chunk(first, vars(last_col()))
  )
#> # A tibble: 3 x 7
#>    gear    vs    wt  qsec   cyl  drat  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     3     3  3.89  17.7     4  2.76     1
#> 2     4    10  2.62  19.0     4  3.69     4
#> 3     5     1  2.63  15.6     4  3.54     2

^{Created on 2019-03-24 by the reprex package (v0.2.1)}

It would be great to allow for summarise (like summarise2 in the reprex above) to be able to accept both named lists but also functions that returns named lists with quosures, before the whole list is passed to summarise_impl.

hadley · 2019-05-27T15:31:28Z

Duplicate of #2326

lock · 2019-11-23T16:17:32Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

davidsjoberg changed the title ~~Hard to collapse data with summarise~~ [feature request] Simplify summarise to collapse data Mar 1, 2019

romainfrancois added the feature a feature request or enhancement label Mar 4, 2019

romainfrancois changed the title ~~[feature request] Simplify summarise to collapse data~~ Simplify summarise to collapse data Mar 4, 2019

hadley marked this as a duplicate of #2326 May 27, 2019

hadley closed this as completed May 27, 2019

lock bot locked and limited conversation to collaborators Nov 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify summarise to collapse data #4232

Simplify summarise to collapse data #4232

davidsjoberg commented Mar 1, 2019

romainfrancois commented Mar 4, 2019

davidsjoberg commented Mar 24, 2019

hadley commented May 27, 2019

lock bot commented Nov 23, 2019

Simplify summarise to collapse data #4232

Simplify summarise to collapse data #4232

Comments

davidsjoberg commented Mar 1, 2019

romainfrancois commented Mar 4, 2019

davidsjoberg commented Mar 24, 2019

hadley commented May 27, 2019

lock bot commented Nov 23, 2019