Summarising verbs with variable-length outputs #2132

lionel- · 2016-09-21T17:42:51Z

A new dplyr family of verbs for variable-length output may be useful.

Like summarise() it would discard all input columns except for the grouping variables. This allows the output to have a different number of rows than the input.
Unlike summarise(), it would not require length 1 results and would only check for equal length within group. Grouping columns would be recycled to these lengths.

It could be called condense(), though it's only condensing in the sense that it get rids of non-grouping variables. May need a better name.

Ungrouped data frame: check squared constraint

mtcars %>% condense(col = 1:5, other = 5:1)
#> # A tibble: 5 x 2
#>     col other
#>   <int> <int>
#> 1     1     5
#> 2     2     4
#> 3     3     3
#> 4     4     2
#> 5     5     1

mtcars %>% condense(col = 1:5, other = 2:1)
#> Error: results must have same length

This gives us immediately:

mtcars %>% condense_all(summary)
#> # A tibble: 6 x 11
#>           mpg         cyl        disp          hp        drat
#>   <S3: table> <S3: table> <S3: table> <S3: table> <S3: table>
#> 1        10.4        4.00        71.1        52.0        2.76
#> 2        15.4        4.00       121.0        96.5        3.08
#> 3        19.2        6.00       196.0       123.0        3.70
#> 4        20.1        6.19       231.0       147.0        3.60
#> 5        22.8        8.00       326.0       180.0        3.92
#> 6        33.9        8.00       472.0       335.0        4.93
#> # ... with 6 more variables: wt <S3: table>, qsec <S3: table>, vs <S3:
#> #   table>, am <S3: table>, gear <S3: table>, carb <S3: table>

For a grouped data frame, we'd check the square constrain within groups:

grouped <- mtcars %>% group_by(am)

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col))
  )
#> # A tibble: 12 x 3
#>       am   col other
#>    <dbl> <dbl> <dbl>
#> 1      0  6.95     7
#> 2      0  6.95     7
#> 3      0  6.95     7
#> 4      0  6.95     7
#> 5      0  6.95     7
#> 6      0  6.95     7
#> 7      0  6.95     7
#> 8      1  5.08     5
#> 9      1  5.08     5
#> 10     1  5.08     5
#> 11     1  5.08     5
#> 12     1  5.08     5

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col) - 1)
  )
#> Error: results must have same length within groups

Relevant discussion: #154

The text was updated successfully, but these errors were encountered:

lionel- · 2016-09-21T17:43:19Z

Could also have disperse() verb that would be like condense() but spreads the result over numbered columns, thus resulting in 1 row per group like summarise()? That would be the equivalent to .collate = "cols" in the deprecated purrr df functions.

With that verb the lengths can be different across results but must be the same across groups.

krlmlr · 2016-09-21T17:58:20Z

Why not simply summarize()-ing into a data frame?

iris %>% group_by(Species) %>% summarize(data = list(data_frame(Sepal.Length)))

lionel- · 2016-09-21T18:12:14Z

It's not the same output structure. I was thinking about these because we're deprecating the purrr df functions and they are more liberal with the kind of outputs they accept.

It's true the alternative is not too verbose, but not very expressive either:

mtcars %>%
  summarize_all(function(x) list(summary(x))) %>%
  tidyr::unnest()

mtcars %>%
  condense_all(summary)

krlmlr · 2016-11-07T20:05:36Z

@hadley: Please advise.

hadley · 2016-11-07T20:16:24Z

Ideally we wouldn't need a separate verb for this, and would instead make summarise() more flexible. But I'm not sure that's compatible with how summarise is currently implemented (i.e. how would we verify that each expression returns the same number of rows).

This is some what related to the idea of having summarise() and mutate() accept tibbles (that's about multiple columns, this is about multiple rows)

krlmlr · 2016-11-07T20:44:39Z

For reference: #2149 (comment)

romainfrancois · 2018-04-23T09:50:21Z

Related to #2326

I typically use a summarise + unnest for these, e.g.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

condense <- function(.data, ...){
  dots <- quos(...)
  
  summarise(.data, ..nested.. = list(tibble(!!!dots)) ) %>% 
    tidyr::unnest(..nested..)
}

mtcars %>% 
  group_by(cyl) %>% 
  condense(col = 1:5, other = 5:1)
#> # A tibble: 15 x 3
#>      cyl   col other
#>    <dbl> <int> <int>
#>  1    4.     1     5
#>  2    4.     2     4
#>  3    4.     3     3
#>  4    4.     4     2
#>  5    4.     5     1
#>  6    6.     1     5
#>  7    6.     2     4
#>  8    6.     3     3
#>  9    6.     4     2
#> 10    6.     5     1
#> 11    8.     1     5
#> 12    8.     2     4
#> 13    8.     3     3
#> 14    8.     4     2
#> 15    8.     5     1

grouped <- mtcars %>% group_by(am)

grouped %>%
  condense(
    col = rep(mean(cyl), times = round(mean(cyl))),
    other = rep(length(col), length(col))
  )
#> # A tibble: 12 x 3
#>       am   col other
#>    <dbl> <dbl> <int>
#>  1    0.  6.95     7
#>  2    0.  6.95     7
#>  3    0.  6.95     7
#>  4    0.  6.95     7
#>  5    0.  6.95     7
#>  6    0.  6.95     7
#>  7    0.  6.95     7
#>  8    1.  5.08     5
#>  9    1.  5.08     5
#> 10    1.  5.08     5
#> 11    1.  5.08     5
#> 12    1.  5.08     5

Created on 2018-04-23 by the reprex package (v0.2.0).

romainfrancois · 2018-12-14T15:40:55Z

group_map() perhaps ?

library(dplyr)

mtcars %>% 
  group_by(am) %>% 
  group_map(~{
    mean_cyl <- mean(.x$cyl)
    tibble(
      col = rep(mean_cyl, times = round(mean_cyl)),
      other = rep(length(col), length(col))
    )
  })
#> # A tibble: 12 x 3
#> # Groups:   am [2]
#>       am   col other
#>  * <dbl> <dbl> <int>
#>  1     0  6.95     7
#>  2     0  6.95     7
#>  3     0  6.95     7
#>  4     0  6.95     7
#>  5     0  6.95     7
#>  6     0  6.95     7
#>  7     0  6.95     7
#>  8     1  5.08     5
#>  9     1  5.08     5
#> 10     1  5.08     5
#> 11     1  5.08     5
#> 12     1  5.08     5

^{Created on 2018-12-14 by the reprex package (v0.2.1.9000)}

romainfrancois · 2018-12-14T15:52:01Z

But it is syntactically far from mutate() and summarise().

Maybe there's room for a quosure like function between summarise() and mutate()

mutate() : results with n() observations, no change to grouping structure
summarise() : result with 1 observation, peeling one layer of grouping structure
condense() / morph() : unspecified number of observations, but the same for each created column, ?? remake grouping structure

romainfrancois · 2019-11-25T14:35:48Z

This is obsolete now that summarise() can return size > 1

krlmlr added the data frame label Nov 7, 2016

hadley added feature a feature request or enhancement verbs 🏃‍♀️ and removed data frame labels Feb 22, 2017

romainfrancois mentioned this issue Apr 30, 2018

Automatically unpack unnamed df-cols #2326

Closed

romainfrancois closed this as completed Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarising verbs with variable-length outputs #2132

Summarising verbs with variable-length outputs #2132

lionel- commented Sep 21, 2016

lionel- commented Sep 21, 2016

krlmlr commented Sep 21, 2016

lionel- commented Sep 21, 2016

krlmlr commented Nov 7, 2016

hadley commented Nov 7, 2016

krlmlr commented Nov 7, 2016

romainfrancois commented Apr 23, 2018

romainfrancois commented Dec 14, 2018

romainfrancois commented Dec 14, 2018

romainfrancois commented Nov 25, 2019

Summarising verbs with variable-length outputs #2132

Summarising verbs with variable-length outputs #2132

Comments

lionel- commented Sep 21, 2016

lionel- commented Sep 21, 2016

krlmlr commented Sep 21, 2016

lionel- commented Sep 21, 2016

krlmlr commented Nov 7, 2016

hadley commented Nov 7, 2016

krlmlr commented Nov 7, 2016

romainfrancois commented Apr 23, 2018

romainfrancois commented Dec 14, 2018

romainfrancois commented Dec 14, 2018

romainfrancois commented Nov 25, 2019