Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarising verbs with variable-length outputs #2132

Closed
lionel- opened this issue Sep 21, 2016 · 10 comments
Closed

Summarising verbs with variable-length outputs #2132

lionel- opened this issue Sep 21, 2016 · 10 comments
Labels
feature a feature request or enhancement verbs 🏃‍♀️

Comments

@lionel-
Copy link
Member

lionel- commented Sep 21, 2016

A new dplyr family of verbs for variable-length output may be useful.

  • Like summarise() it would discard all input columns except for the grouping variables. This allows the output to have a different number of rows than the input.
  • Unlike summarise(), it would not require length 1 results and would only check for equal length within group. Grouping columns would be recycled to these lengths.

It could be called condense(), though it's only condensing in the sense that it get rids of non-grouping variables. May need a better name.

Ungrouped data frame: check squared constraint

mtcars %>% condense(col = 1:5, other = 5:1)
#> # A tibble: 5 x 2
#>     col other
#>   <int> <int>
#> 1     1     5
#> 2     2     4
#> 3     3     3
#> 4     4     2
#> 5     5     1

mtcars %>% condense(col = 1:5, other = 2:1)
#> Error: results must have same length

This gives us immediately:

mtcars %>% condense_all(summary)
#> # A tibble: 6 x 11
#>           mpg         cyl        disp          hp        drat
#>   <S3: table> <S3: table> <S3: table> <S3: table> <S3: table>
#> 1        10.4        4.00        71.1        52.0        2.76
#> 2        15.4        4.00       121.0        96.5        3.08
#> 3        19.2        6.00       196.0       123.0        3.70
#> 4        20.1        6.19       231.0       147.0        3.60
#> 5        22.8        8.00       326.0       180.0        3.92
#> 6        33.9        8.00       472.0       335.0        4.93
#> # ... with 6 more variables: wt <S3: table>, qsec <S3: table>, vs <S3:
#> #   table>, am <S3: table>, gear <S3: table>, carb <S3: table>

For a grouped data frame, we'd check the square constrain within groups:

grouped <- mtcars %>% group_by(am)

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col))
  )
#> # A tibble: 12 x 3
#>       am   col other
#>    <dbl> <dbl> <dbl>
#> 1      0  6.95     7
#> 2      0  6.95     7
#> 3      0  6.95     7
#> 4      0  6.95     7
#> 5      0  6.95     7
#> 6      0  6.95     7
#> 7      0  6.95     7
#> 8      1  5.08     5
#> 9      1  5.08     5
#> 10     1  5.08     5
#> 11     1  5.08     5
#> 12     1  5.08     5

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col) - 1)
  )
#> Error: results must have same length within groups

Relevant discussion: #154

@lionel-
Copy link
Member Author

lionel- commented Sep 21, 2016

Could also have disperse() verb that would be like condense() but spreads the result over numbered columns, thus resulting in 1 row per group like summarise()? That would be the equivalent to .collate = "cols" in the deprecated purrr df functions.

With that verb the lengths can be different across results but must be the same across groups.

@krlmlr
Copy link
Member

krlmlr commented Sep 21, 2016

Why not simply summarize()-ing into a data frame?

iris %>% group_by(Species) %>% summarize(data = list(data_frame(Sepal.Length)))

@lionel-
Copy link
Member Author

lionel- commented Sep 21, 2016

It's not the same output structure. I was thinking about these because we're deprecating the purrr df functions and they are more liberal with the kind of outputs they accept.

It's true the alternative is not too verbose, but not very expressive either:

mtcars %>%
  summarize_all(function(x) list(summary(x))) %>%
  tidyr::unnest()

mtcars %>%
  condense_all(summary)

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

@hadley: Please advise.

@hadley
Copy link
Member

hadley commented Nov 7, 2016

Ideally we wouldn't need a separate verb for this, and would instead make summarise() more flexible. But I'm not sure that's compatible with how summarise is currently implemented (i.e. how would we verify that each expression returns the same number of rows).

This is some what related to the idea of having summarise() and mutate() accept tibbles (that's about multiple columns, this is about multiple rows)

@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

For reference: #2149 (comment)

@hadley hadley added feature a feature request or enhancement verbs 🏃‍♀️ and removed data frame labels Feb 22, 2017
@romainfrancois
Copy link
Member

Related to #2326

I typically use a summarise + unnest for these, e.g.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

condense <- function(.data, ...){
  dots <- quos(...)
  
  summarise(.data, ..nested.. = list(tibble(!!!dots)) ) %>% 
    tidyr::unnest(..nested..)
}

mtcars %>% 
  group_by(cyl) %>% 
  condense(col = 1:5, other = 5:1)
#> # A tibble: 15 x 3
#>      cyl   col other
#>    <dbl> <int> <int>
#>  1    4.     1     5
#>  2    4.     2     4
#>  3    4.     3     3
#>  4    4.     4     2
#>  5    4.     5     1
#>  6    6.     1     5
#>  7    6.     2     4
#>  8    6.     3     3
#>  9    6.     4     2
#> 10    6.     5     1
#> 11    8.     1     5
#> 12    8.     2     4
#> 13    8.     3     3
#> 14    8.     4     2
#> 15    8.     5     1

grouped <- mtcars %>% group_by(am)

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col))
  )
#> # A tibble: 12 x 3
#>       am   col other
#>    <dbl> <dbl> <int>
#>  1    0.  6.95     7
#>  2    0.  6.95     7
#>  3    0.  6.95     7
#>  4    0.  6.95     7
#>  5    0.  6.95     7
#>  6    0.  6.95     7
#>  7    0.  6.95     7
#>  8    1.  5.08     5
#>  9    1.  5.08     5
#> 10    1.  5.08     5
#> 11    1.  5.08     5
#> 12    1.  5.08     5

Created on 2018-04-23 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member

group_map() perhaps ?

library(dplyr)

mtcars %>% 
  group_by(am) %>% 
  group_map(~{
    mean_cyl <- mean(.x$cyl)
    tibble(
      col = rep(mean_cyl, times = round(mean_cyl)),
      other = rep(length(col), length(col))
    )
  })
#> # A tibble: 12 x 3
#> # Groups:   am [2]
#>       am   col other
#>  * <dbl> <dbl> <int>
#>  1     0  6.95     7
#>  2     0  6.95     7
#>  3     0  6.95     7
#>  4     0  6.95     7
#>  5     0  6.95     7
#>  6     0  6.95     7
#>  7     0  6.95     7
#>  8     1  5.08     5
#>  9     1  5.08     5
#> 10     1  5.08     5
#> 11     1  5.08     5
#> 12     1  5.08     5

Created on 2018-12-14 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Member

But it is syntactically far from mutate() and summarise().

Maybe there's room for a quosure like function between summarise() and mutate()

  • mutate() : results with n() observations, no change to grouping structure
  • summarise() : result with 1 observation, peeling one layer of grouping structure
  • condense() / morph() : unspecified number of observations, but the same for each created column, ?? remake grouping structure

@romainfrancois
Copy link
Member

This is obsolete now that summarise() can return size > 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

4 participants