Optional parameter to control length of summarise #154

hadley · 2013-12-06T15:23:02Z

It would be useful to have a parameter that states the number of values the function should return. It's sometimes useful to have a summary function that returns multiple values, like if you're computing a fixed set of quantiles. The values should run down the column, and the grouping variables should be repeated.

For example:

summarise(group_by(mtcars, cyl), mpg = quantile(mpg), .n = 5)

We don't need to label the values, since the user could always do that themselves:

qs <- c(0, 0.5, 1)
summarise(group_by(mtcars, cyl), mpg = quantile(mpg, qs), q = qs, .n = 3)

When n > 1, summarise() shouldn't drop the last group from the grouping, since you might want to summarise the values you just computed.

The text was updated successfully, but these errors were encountered:

romainfrancois · 2013-12-08T20:09:54Z

Returning more than one thing in summarise is a problem in the current design, but I understand the usefulness.
I'm not sure I like the additional parameter though. What about having something at the left of the =. Something like this:

summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )

Unfortunately that does not parse.

> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )" )
Erreur dans parse(text = "summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )") :
  <text>:1:45: '=' inattendu(e)
1: summarise(group_by(mtcars, cyl), c(min,max) =
                                                ^

However, this does:

> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) := range(mpg) )" )
expression(summarise(group_by(mtcars, cyl), c(min,max) := range(mpg) ))
> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) %=% range(mpg) )" )
expression(summarise(group_by(mtcars, cyl), c(min,max) %=% range(mpg) ))

Not sure I like this either.

What worries me about .n is that it has then to be about all the expressions. What would we do if we wanted one expression to return one value and another one to return more:

summarise(group_by(mtcars, cyl), mpg = quantile(mpg), mean = mean(mpg), .n = 5)

davidkane9 · 2014-07-05T13:45:06Z

Let me add another voice to those requesting this functionality. For me, and others in finance, there is often a need to calculate --- for many companies and many dates --- measures like a trailing standard deviation. This is easy, for a single company, by using functions like rollapply() from library(xts). To be able to call such functions within dplyr would be wonderful, and would probably create a much wider user base within finance, or any community that uses a lot of times series data, for dplyr.

romainfrancois · 2014-10-01T12:55:53Z

Can we have something that marks the expected size of a result, e.g. :

mtcars %>% summarise( x = several(quantile(mpg), 5) )

or at least mark that something is expected to have more than one result:

mtcars %>% summarise( x = multiple(quantile(mpg) )

Or something. I think retaining the default expectation of only one result otherwise makes sense and protects from mistakes. Having the user be explicit forces them to think about it.

Also kind of like something like this:

mtcars %>% summarise( c(min,max) := range(mgp) )

Not sure that would play along with lazyeval though and of course we would potentially introduce confusion about :=

hadley · 2014-10-01T13:22:13Z

I was thinking of an additional argument to summarise - .n. It would default to .n = 1.

romainfrancois · 2014-10-01T13:25:44Z

Does .n apply to all the expressions ? would it be a vector of expected sizes ?
Once all the information flows down to the c++ side, it should not be too hard to accomodate.

hadley · 2014-10-01T13:26:30Z

@romainfrancois I think every expression would have to return an object of the same length.

romainfrancois · 2014-10-01T13:28:27Z

That's easy enough then, but won't people want to do e.g.

summarise( quantile(mpg), max(disp) )

or something ?

hadley · 2014-10-01T13:30:06Z

Maybe. That would recycle max?

If you allow multiple lengths, then you'll have to check all smaller lengths are divisors of the largest, e.g. this would need to be an error:

summarise( quantile(mpg, c(0.25, 0.75)), quantile(mpg, c(0.25, 0.5, 0.75)))

romainfrancois · 2014-10-01T13:34:22Z

How about something that says automatic .n, i.e. whatever the length is the first time the expression is evaluated, that's what you want, and it's an error if a later result is not of that length. Something like :

summarise( quantile(mpg, c(0.25, 0.75)), quantile(mpg, c(0.25, 0.5, 0.75)), .n = first)

or perhaps a sibling to summarise that does just that.

hadley · 2014-10-01T13:37:33Z

I like that! If we can make that work without too much effort, it would be great to make that the default.

romainfrancois · 2014-10-01T13:49:51Z

( I was looking for this 👍 but then I realized it always comes uninvited whenever I just want a ":". )

It should not be too much work if data type is homogeneous, e.g. we get a length 4 numeric vector. fine.

I'm worried about cases where we would e.g. get a list with different types, should we handle that... but I guess this is not a problem as we could say it makes a VECSXP column perhaps.

The other thing that troubles me a bit but is about naming the result columns. Say we have something like mpg = quantile(mpg) do we use the names from the result of quantile, do we make up names from mpg and some automatic numbering, do we have something that expresses these names ... those names are needed for building the output data, but also could be referred to in the next expression, although I'm not sure this would really happen.

hadley · 2014-10-01T14:02:37Z

I'd say we'd just ignore the names - you could always add as a separate col:

probs <- seq(0, 1, length = 5)
summarise(mtcars, probs, quantiles(mpg, probs))

romainfrancois · 2014-11-13T13:47:48Z

Just trying to ease back in to this. When we get multiple results, e.g. quantile, are we expecting several columns, a matrix column, a list column, a data.frame column or whatever ?

Or do we want to be a able to choose between some of these options.

e.g. a list column would be an easy way to get whatever, no need to impose constraints on the individual results, could be of different sizes, whatever.

Perhaps for something like quantile a matrix or data.frame column makes some sense ...

hadley · 2014-11-14T13:37:55Z

It will be a vector (either atomic or list). (Potentially it could be a data frame or matrix, but I don't think we need to worry about that for now).

Maybe instead of specifying n we should specifying a template for the output (like vapply()). Maybe something like this?

mtcars %>% 
   summarise(q = quantile(mpg, c(0.25, 0.5, 0.75)), .out = list(q = double(3)))

If .out was not provided we try guess it by running the first subset. I'm not sure if this should be part of summarise, or be a different function. This starts to blur the line with do() (which might be a good thing since do() is necessarily slow because it has to manipulate data frames)

romainfrancois · 2014-11-16T10:52:46Z

Right, still don't get it. :/ Say we want

d <- mtcars %>% group_by(cyl) 
summarise( d, quantile(wt, c(.25, .5, .75) ) )

Potential cases I'm thinking of:

> do( d, as.data.frame(quantile(.$wt, c(.25, .75) ) ) )
Source: local data frame [6 x 2]
Groups: cyl

  cyl quantile(.$wt, c(0.25, 0.75))
1   4                       1.88500
2   4                       2.62250
3   6                       2.82250
4   6                       3.44000
5   8                       3.53250
6   8                       4.01375

I'm not particularly happy with this one because summarise is supposed to yield one row per group.

Or:

> do( d, as.data.frame( t(quantile(.$wt, c(.25, .75) ) ) ) )
Source: local data frame [3 x 3]
Groups: cyl

  cyl    25%     75%
1   4 1.8850 2.62250
2   6 2.8225 3.44000
3   8 3.5325 4.01375

Not completely happy with this one either because from one expression we would get several columns.

Or something like this:

> out <- data.frame( cyl = c(4,6,8), q = 0 )
> out$q <- as.data.frame(do( d, as.data.frame( t(quantile(.$wt, c(.25, .75) ) ) ) ))[,-1]
> out %>% str
'data.frame':   3 obs. of  2 variables:
 $ cyl: num  4 6 8
 $ q  :'data.frame':    3 obs. of  2 variables:
  ..$ 25%: num  1.89 2.82 3.53
  ..$ 75%: num  2.62 3.44 4.01

Or something else. Once I know what to aim for, the code should write itself fairly easily.

hadley · 2014-11-21T21:07:12Z

Right, this would require we relax the constraint on summarise() from "one row to group", to "n rows per group". That may be a big enough change that it should be a different verb.

Yet another interface would be to return a list-column like:

do(d, q = quantile(.$wt, c(.25, .75))

Maybe we could make that work as is if you added an explicit list():

summarise(d, list(quantile(wt, c(.25, .75)))

It seems like we need something half-way between summarise() and do() - more flexible than summarise and more efficient than do.

romainfrancois · 2014-12-16T09:53:44Z

Back again here. Given #832 we can now use this syntax:

> summarise(d, list(quantile(wt, c(.25, .75))))
Source: local data frame [3 x 2]

  cyl list(quantile(wt, c(0.25, 0.75)))
1   4                          <dbl[2]>
2   6                          <dbl[2]>
3   8                          <dbl[2]>
>
> summarise(d, list(quantile(wt, c(.25, .75)))) %>% str
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  2 variables:
 $ cyl                              : num  4 6 8
 $ list(quantile(wt, c(0.25, 0.75))):List of 3
  ..$ : Named num  1.89 2.62
  .. ..- attr(*, "names")= chr  "25%" "75%"
  ..$ : Named num  2.82 3.44
  .. ..- attr(*, "names")= chr  "25%" "75%"
  ..$ : Named num  3.53 4.01
  .. ..- attr(*, "names")= chr  "25%" "75%"
 - attr(*, "drop")= logi TRUE

Could be some other function's job to flatten the list columns, e.g :

d %>% summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% flatten( quantiles )

or something. This way summarise don't have to change and we can later decide which variables to organize, how, etc ... if we come up with the right syntax for it.

romainfrancois · 2014-12-16T09:56:34Z

romainfrancois · 2014-12-16T09:57:13Z

Perhaps that's a job for tidyr ? @hadley ?

romainfrancois · 2014-12-16T10:08:32Z

or perhaps

# controlling the names 
d %>% 
  summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% 
  flatten( quantiles, names = c("q25", "q75") )


# automatic names by default
# i.e. by checking that all items in the list column have the same names
d %>% 
  summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% 
  flatten( quantiles )

romainfrancois · 2014-12-16T10:11:02Z

Or flatten could be a pronoun, e.g.

d %>% 
  summarise( flatten( quantile(wt, c(.25, .75)), names = c("q25", "q75" ) ) )

romainfrancois · 2014-12-16T10:17:28Z

In terms of syntax, I'd prefer something like flatten to be a verb.

Performance wise, making it a pronoun of summarise might be better as we would not have to allocate the list ...

brianstamper · 2016-02-24T14:15:13Z

I came across this problem today, and found to get the effect of "flatten" described above I could use the composition of t() and as.data.frame():

library("dplyr")
df <- data.frame(foo = rep(c("a", "b", "c"), 4), bar = 1:12)

df.q <- df %>%
  group_by(foo) %>%
  summarise(quantiles = list(quantile(bar)))
q <- t(as.data.frame(df.q$quantiles))
row.names(q) <- NULL
df.q <- cbind(df.q, q)
df.q$quantiles <- NULL

> df.q
  foo 0%  25% 50%  75% 100%
1   a  1 3.25 5.5 7.75   10
2   b  2 4.25 6.5 8.75   11
3   c  3 5.25 7.5 9.75   12

petersmp · 2016-10-19T19:55:37Z

I am not sure that a separate row per output is the best approach here. Following from @brianstamper , I occasionally find myself doing things like this:

df %>%
  group_by(foo) %>%
  summarise_(.dots = setNames(paste0("summary(bar)[",1:6,"]"), names(summary(1:5))))

To get outputs like this:

# A tibble: 3 × 7
     foo  `Min. bar` `1st Qu. bar` `Median bar`  `Mean bar` `3rd Qu. bar`  `Max. bar`
  <fctr> <S3: table>   <S3: table>  <S3: table> <S3: table>   <S3: table> <S3: table>
1      a           1          3.25          5.5         5.5          7.75          10
2      b           2          4.25          6.5         6.5          8.75          11
3      c           3          5.25          7.5         7.5          9.75          12

I have found workarounds like this often (this is actually perhaps the cleanest I have found), but they never seem quite satisfying. It would be nice if I could use something like summarise directly, particularly when I want to also calculate other summaries (e.g., the mean of some other variable). Perhaps summarise_ is the right approach, but there may be value in making this more straightforward.

hadley · 2017-02-02T20:56:35Z

Now part of #2326

ghost assigned romainfrancois Dec 6, 2013

piccolbo mentioned this issue Apr 4, 2014

summarise or mutate with functions returning multiple values/columns #372

Closed

hadley modified the milestones: 0.3.1, 0.3 Sep 11, 2014

hadley modified the milestones: 0.4, 0.3.1 Oct 30, 2014

lionel- mentioned this issue Jan 17, 2015

summarise_each_q naming consistency #442

Closed

acthomasca mentioned this issue Jun 13, 2015

segfaulting problem on Ubuntu Linux, again #952

Closed

hadley mentioned this issue Aug 24, 2015

Interleave multiple function summarise_each() output [Feature Request] #1335

Closed

mine-cetinkaya-rundel mentioned this issue Sep 12, 2015

removed some dollar signs in intro to r lab OpenIntroStat/oilabs-tidy#9

Closed

hadley modified the milestones: 0.6, 0.5 Oct 22, 2015

hadley mentioned this issue Nov 6, 2015

A better summary function #1514

Closed

lionel- mentioned this issue Aug 18, 2016

Deprecate dataframe-based mapping functions tidyverse/purrr#226

Merged

lionel- mentioned this issue Sep 21, 2016

Summarising verbs with variable-length outputs #2132

Closed

etiennebr mentioned this issue Dec 2, 2016

Multivariate mutate #2286

Closed

hadley closed this as completed Feb 2, 2017

romainfrancois mentioned this issue Dec 20, 2017

Automatically unpack unnamed df-cols #2326

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional parameter to control length of summarise #154

Optional parameter to control length of summarise #154

hadley commented Dec 6, 2013

romainfrancois commented Dec 8, 2013

davidkane9 commented Jul 5, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Nov 13, 2014

hadley commented Nov 14, 2014

romainfrancois commented Nov 16, 2014

hadley commented Nov 21, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

brianstamper commented Feb 24, 2016

petersmp commented Oct 19, 2016

hadley commented Feb 2, 2017

Optional parameter to control length of summarise #154

Optional parameter to control length of summarise #154

Comments

hadley commented Dec 6, 2013

romainfrancois commented Dec 8, 2013

davidkane9 commented Jul 5, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Oct 1, 2014

hadley commented Oct 1, 2014

romainfrancois commented Nov 13, 2014

hadley commented Nov 14, 2014

romainfrancois commented Nov 16, 2014

hadley commented Nov 21, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

romainfrancois commented Dec 16, 2014

brianstamper commented Feb 24, 2016

petersmp commented Oct 19, 2016

hadley commented Feb 2, 2017