Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify summarise to collapse data #4232

Closed
davidsjoberg opened this issue Mar 1, 2019 · 4 comments
Closed

Simplify summarise to collapse data #4232

davidsjoberg opened this issue Mar 1, 2019 · 4 comments
Labels
feature a feature request or enhancement

Comments

@davidsjoberg
Copy link

Hi,

dplyr is awesome, thank you for that. An issue I have come across multiple times when reviewing and writing code is collapsing data with dplyr. Often users want to have multiple summary functions like min, max, mean, first for different columns. summarise_at works great for homogenous tibbles when all columns should be summed up. But since it is destructive of the original tibble it can't be piped with other summarise_at using other aggregation functions.

A common way to write:

# Load latest version av dplyr
if(packageVersion("dplyr") < "0.8") {
  stop("Update dplyr")
} else {
  library(dplyr, quietly = T, warn.conflicts = F)
  }
#> Warning: package 'dplyr' was built under R version 3.4.4

# How users often collapse with dplyr
mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    wt   = sum(wt),
    qsec = sum(wt),
    disp = first(disp),
    drat = first(drat),
    mpg  = min(mpg, na.rm = T),
    hp   = min(hp, na.rm = T),
    vs   = min(vs, na.rm = T),
    am   = mean(am),
    gear = mean(gear),
    carb = mean(carb)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  25.1   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  21.8   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  56.0   360  3.15  10.4   150     0 0.143  3.29  3.5

Created on 2019-03-01 by the reprex package (v0.2.1)

The main problem with the above approach is that is tedious, especially if a tibble have more than 50 columns. It is also prone to errors since the user have to double reference column and functions. Basically, it is hard to get efficiency and dynamic code when collapsing data.

A solution would be to let users specify chunks of columns that should be summarised with the same function (and keep original name). I tried to make a simple take on how this would work.

### Creating the functions chunk and summarise_by

# Load latest version av dplyr
if(packageVersion("dplyr") < "0.8") {
  stop("Update dplyr")
} else {
  library(dplyr, quietly = T, warn.conflicts = F)
}
#> Warning: package 'dplyr' was built under R version 3.4.4


# chunk
chunk <- function (.funs, .vars, ...) 
{
  syms <- syms(substring(as.character(.vars), 2))
  funs <- dplyr:::as_fun_list(.funs, enquo(.funs), rlang::caller_env(), ...)
  dplyr:::manip_apply_syms(funs, syms, NULL)
}

# summarise_by (chunks)
summarise_by <- function(.tbl, ...){
  dplyr:::summarise_impl(.tbl, c(...), environment(), rlang::caller_env())
}

### Apply solution
mtcars %>% 
  group_by(cyl) %>% 
  summarise_by(
     chunk(sum,            vars(wt, qsec)), 
     chunk(first,          vars(disp, drat)),
     chunk(min, na.rm = T, vars(mpg, hp, vs)),
     chunk(mean,           vars(am, gear, carb))
     )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

Created on 2019-03-01 by the reprex package (v0.2.1)

Some things that I have not solved are incorporation of tidyselect and probably it could work without creating a new summarise_*- function. Preferred syntax would be:

# mtcars %>% 
#   group_by(cyl) %>% 
#   summarise(
#     wt = sum(wt),
#     chunk(first,          vars(disp, drat)),
#     chunk(min, na.rm = T, vars(matches("hp|vs")))
#   )

Is this an issue that is worth digging into?

@davidsjoberg davidsjoberg changed the title Hard to collapse data with summarise [feature request] Simplify summarise to collapse data Mar 1, 2019
@romainfrancois romainfrancois added the feature a feature request or enhancement label Mar 4, 2019
@romainfrancois romainfrancois changed the title [feature request] Simplify summarise to collapse data Simplify summarise to collapse data Mar 4, 2019
@romainfrancois
Copy link
Member

We are aware that this is a problem, and are running various experimentations about it.

  • in 0.9.* you'll be able to return data frames (with 1 row) from summarise expressions
  • your implementation is somewhat similar to what I've played with in the splice 📦
library(dplyr, warn.conflicts = FALSE)
library(splice)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(
    !!!at_(vars(wt, qsec), sum), 
    !!!at_(vars(disp, drat), first),
    !!!at_(vars(mpg, hp, vs), min, na.rm = TRUE),
    !!!at_(vars(am, gear, carb), mean)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

Here the ... are used for extra function parameters, so the tidy selection has to be captured by vars()

Created on 2019-03-04 by the reprex package (v0.2.1.9000)

  • but I currently prefer the approach I'm taking with the dance 📦 💃 🕺 with the tango() + swing() combo:
library(purrr)
library(dplyr, warn.conflicts = FALSE)
library(dance)
mtcars %>% 
  group_by(cyl) %>% 
  tango(
    swing(sum, wt, qsec), 
    swing(first, disp, drat), 
    swing(~min(., na.rm = TRUE), mpg, hp, vs), 
    swing(mean, am, gear, carb)
  )
#> # A tibble: 3 x 11
#>     cyl    wt  qsec  disp  drat   mpg    hp    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  25.1  211.   108  3.85  21.4    52     0 0.727  4.09  1.55
#> 2     6  21.8  126.   160  3.9   17.8   105     0 0.429  3.86  3.43
#> 3     8  56.0  235.   360  3.15  10.4   150     0 0.143  3.29  3.5

The take is slightly different here, as swing() takes only one function, or lambda, and the ... are used to give tidy selection, which here is just enumeration of variables.

@davidsjoberg
Copy link
Author

Thanks for your answer. Both approaches are really good! The nice with dance is that no !!! are needed. But splice is used in conjuction with summarise which I find intuitive.

It took me some time to understand how splice found its way in the environment to get tidyselect to work. But I finally managed to work around the !!! syntax of splice while having syntax that or more similiar to dance. Normal summarise-syntax also works if users prefer to pass named lists as usual.

# Functions
library(tidyverse)

summarise2 <- function (.data, ...) {
  dots <- quos(...)
  dots <- unlist(map_if(dots, names(dots) == "", ~rlang::eval_tidy(.)))
  dplyr:::summarise_impl(.data, dots, environment(), rlang::caller_env())
}

eval_context <- function (...) {
  calls <- sys.calls()
  frames <- sys.frames()
  n <- length(frames)
  list(.data = frames[[n - 7]]$.data, .env = frames[[n - 7]], 
       ...)
}

chunk <- function (.funs, .vars, ...)  {
  context <- eval_context()
  dplyr:::manip_at(context$.data, .vars, .funs, enquo(.funs), context$.env, ...)
}

# Example
mtcars %>% 
  group_by(gear) %>% 
  summarise2(
    vs = sum(vs), 
    chunk(mean,  vars(wt, qsec)),
    chunk(min,   vars(matches("drat|cyl"))),
    chunk(first, vars(last_col()))
  )
#> # A tibble: 3 x 7
#>    gear    vs    wt  qsec   cyl  drat  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     3     3  3.89  17.7     4  2.76     1
#> 2     4    10  2.62  19.0     4  3.69     4
#> 3     5     1  2.63  15.6     4  3.54     2

Created on 2019-03-24 by the reprex package (v0.2.1)

It would be great to allow for summarise (like summarise2 in the reprex above) to be able to accept both named lists but also functions that returns named lists with quosures, before the whole list is passed to summarise_impl.

@hadley
Copy link
Member

hadley commented May 27, 2019

Duplicate of #2326

@hadley hadley marked this as a duplicate of #2326 May 27, 2019
@hadley hadley closed this as completed May 27, 2019
@lock
Copy link

lock bot commented Nov 23, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants