Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional parameter to control length of summarise #154

Closed
hadley opened this issue Dec 6, 2013 · 25 comments
Closed

Optional parameter to control length of summarise #154

hadley opened this issue Dec 6, 2013 · 25 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Dec 6, 2013

It would be useful to have a parameter that states the number of values the function should return. It's sometimes useful to have a summary function that returns multiple values, like if you're computing a fixed set of quantiles. The values should run down the column, and the grouping variables should be repeated.

For example:

summarise(group_by(mtcars, cyl), mpg = quantile(mpg), .n = 5)

We don't need to label the values, since the user could always do that themselves:

qs <- c(0, 0.5, 1)
summarise(group_by(mtcars, cyl), mpg = quantile(mpg, qs), q = qs, .n = 3)

When n > 1, summarise() shouldn't drop the last group from the grouping, since you might want to summarise the values you just computed.

@ghost ghost assigned romainfrancois Dec 6, 2013
@romainfrancois
Copy link
Member

Returning more than one thing in summarise is a problem in the current design, but I understand the usefulness.
I'm not sure I like the additional parameter though. What about having something at the left of the =. Something like this:

summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )

Unfortunately that does not parse.

> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )" )
Erreur dans parse(text = "summarise(group_by(mtcars, cyl), c(min,max) = range(mpg) )") :
  <text>:1:45: '=' inattendu(e)
1: summarise(group_by(mtcars, cyl), c(min,max) =
                                                ^

However, this does:

> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) := range(mpg) )" )
expression(summarise(group_by(mtcars, cyl), c(min,max) := range(mpg) ))
> parse( text = "summarise(group_by(mtcars, cyl), c(min,max) %=% range(mpg) )" )
expression(summarise(group_by(mtcars, cyl), c(min,max) %=% range(mpg) ))

Not sure I like this either.

What worries me about .n is that it has then to be about all the expressions. What would we do if we wanted one expression to return one value and another one to return more:

summarise(group_by(mtcars, cyl), mpg = quantile(mpg), mean = mean(mpg), .n = 5)

@davidkane9
Copy link
Contributor

Let me add another voice to those requesting this functionality. For me, and others in finance, there is often a need to calculate --- for many companies and many dates --- measures like a trailing standard deviation. This is easy, for a single company, by using functions like rollapply() from library(xts). To be able to call such functions within dplyr would be wonderful, and would probably create a much wider user base within finance, or any community that uses a lot of times series data, for dplyr.

@hadley hadley modified the milestones: 0.3.1, 0.3 Sep 11, 2014
@romainfrancois
Copy link
Member

Can we have something that marks the expected size of a result, e.g. :

mtcars %>% summarise( x = several(quantile(mpg), 5) )

or at least mark that something is expected to have more than one result:

mtcars %>% summarise( x = multiple(quantile(mpg) )

Or something. I think retaining the default expectation of only one result otherwise makes sense and protects from mistakes. Having the user be explicit forces them to think about it.

Also kind of like something like this:

mtcars %>% summarise( c(min,max) := range(mgp) )

Not sure that would play along with lazyeval though and of course we would potentially introduce confusion about :=

@hadley
Copy link
Member Author

hadley commented Oct 1, 2014

I was thinking of an additional argument to summarise - .n. It would default to .n = 1.

@romainfrancois
Copy link
Member

Does .n apply to all the expressions ? would it be a vector of expected sizes ?
Once all the information flows down to the c++ side, it should not be too hard to accomodate.

@hadley
Copy link
Member Author

hadley commented Oct 1, 2014

@romainfrancois I think every expression would have to return an object of the same length.

@romainfrancois
Copy link
Member

That's easy enough then, but won't people want to do e.g.

summarise( quantile(mpg), max(disp) )

or something ?

@hadley
Copy link
Member Author

hadley commented Oct 1, 2014

Maybe. That would recycle max?

If you allow multiple lengths, then you'll have to check all smaller lengths are divisors of the largest, e.g. this would need to be an error:

summarise( quantile(mpg, c(0.25, 0.75)), quantile(mpg, c(0.25, 0.5, 0.75)))

@romainfrancois
Copy link
Member

How about something that says automatic .n, i.e. whatever the length is the first time the expression is evaluated, that's what you want, and it's an error if a later result is not of that length. Something like :

summarise( quantile(mpg, c(0.25, 0.75)), quantile(mpg, c(0.25, 0.5, 0.75)), .n = first)

or perhaps a sibling to summarise that does just that.

@hadley
Copy link
Member Author

hadley commented Oct 1, 2014

I like that! If we can make that work without too much effort, it would be great to make that the default.

@romainfrancois
Copy link
Member

( I was looking for this 👍 but then I realized it always comes uninvited whenever I just want a ":". )

It should not be too much work if data type is homogeneous, e.g. we get a length 4 numeric vector. fine.

I'm worried about cases where we would e.g. get a list with different types, should we handle that... but I guess this is not a problem as we could say it makes a VECSXP column perhaps.

The other thing that troubles me a bit but is about naming the result columns. Say we have something like mpg = quantile(mpg) do we use the names from the result of quantile, do we make up names from mpg and some automatic numbering, do we have something that expresses these names ... those names are needed for building the output data, but also could be referred to in the next expression, although I'm not sure this would really happen.

@hadley
Copy link
Member Author

hadley commented Oct 1, 2014

I'd say we'd just ignore the names - you could always add as a separate col:

probs <- seq(0, 1, length = 5)
summarise(mtcars, probs, quantiles(mpg, probs))

@hadley hadley modified the milestones: 0.4, 0.3.1 Oct 30, 2014
@romainfrancois
Copy link
Member

Just trying to ease back in to this. When we get multiple results, e.g. quantile, are we expecting several columns, a matrix column, a list column, a data.frame column or whatever ?

Or do we want to be a able to choose between some of these options.

e.g. a list column would be an easy way to get whatever, no need to impose constraints on the individual results, could be of different sizes, whatever.

Perhaps for something like quantile a matrix or data.frame column makes some sense ...

@hadley
Copy link
Member Author

hadley commented Nov 14, 2014

It will be a vector (either atomic or list). (Potentially it could be a data frame or matrix, but I don't think we need to worry about that for now).

Maybe instead of specifying n we should specifying a template for the output (like vapply()). Maybe something like this?

mtcars %>% 
   summarise(q = quantile(mpg, c(0.25, 0.5, 0.75)), .out = list(q = double(3)))

If .out was not provided we try guess it by running the first subset. I'm not sure if this should be part of summarise, or be a different function. This starts to blur the line with do() (which might be a good thing since do() is necessarily slow because it has to manipulate data frames)

@romainfrancois
Copy link
Member

Right, still don't get it. :/ Say we want

d <- mtcars %>% group_by(cyl) 
summarise( d, quantile(wt, c(.25, .5, .75) ) )

Potential cases I'm thinking of:

> do( d, as.data.frame(quantile(.$wt, c(.25, .75) ) ) )
Source: local data frame [6 x 2]
Groups: cyl

  cyl quantile(.$wt, c(0.25, 0.75))
1   4                       1.88500
2   4                       2.62250
3   6                       2.82250
4   6                       3.44000
5   8                       3.53250
6   8                       4.01375

I'm not particularly happy with this one because summarise is supposed to yield one row per group.

Or:

> do( d, as.data.frame( t(quantile(.$wt, c(.25, .75) ) ) ) )
Source: local data frame [3 x 3]
Groups: cyl

  cyl    25%     75%
1   4 1.8850 2.62250
2   6 2.8225 3.44000
3   8 3.5325 4.01375

Not completely happy with this one either because from one expression we would get several columns.

Or something like this:

> out <- data.frame( cyl = c(4,6,8), q = 0 )
> out$q <- as.data.frame(do( d, as.data.frame( t(quantile(.$wt, c(.25, .75) ) ) ) ))[,-1]
> out %>% str
'data.frame':   3 obs. of  2 variables:
 $ cyl: num  4 6 8
 $ q  :'data.frame':    3 obs. of  2 variables:
  ..$ 25%: num  1.89 2.82 3.53
  ..$ 75%: num  2.62 3.44 4.01

Or something else. Once I know what to aim for, the code should write itself fairly easily.

@hadley
Copy link
Member Author

hadley commented Nov 21, 2014

Right, this would require we relax the constraint on summarise() from "one row to group", to "n rows per group". That may be a big enough change that it should be a different verb.

Yet another interface would be to return a list-column like:

do(d, q = quantile(.$wt, c(.25, .75))

Maybe we could make that work as is if you added an explicit list():

summarise(d, list(quantile(wt, c(.25, .75)))

It seems like we need something half-way between summarise() and do() - more flexible than summarise and more efficient than do.

@romainfrancois
Copy link
Member

Back again here. Given #832 we can now use this syntax:

> summarise(d, list(quantile(wt, c(.25, .75))))
Source: local data frame [3 x 2]

  cyl list(quantile(wt, c(0.25, 0.75)))
1   4                          <dbl[2]>
2   6                          <dbl[2]>
3   8                          <dbl[2]>
>
> summarise(d, list(quantile(wt, c(.25, .75)))) %>% str
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   3 obs. of  2 variables:
 $ cyl                              : num  4 6 8
 $ list(quantile(wt, c(0.25, 0.75))):List of 3
  ..$ : Named num  1.89 2.62
  .. ..- attr(*, "names")= chr  "25%" "75%"
  ..$ : Named num  2.82 3.44
  .. ..- attr(*, "names")= chr  "25%" "75%"
  ..$ : Named num  3.53 4.01
  .. ..- attr(*, "names")= chr  "25%" "75%"
 - attr(*, "drop")= logi TRUE

Could be some other function's job to flatten the list columns, e.g :

d %>% summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% flatten( quantiles )

or something. This way summarise don't have to change and we can later decide which variables to organize, how, etc ... if we come up with the right syntax for it.

@romainfrancois
Copy link
Member

flatten

@romainfrancois
Copy link
Member

Perhaps that's a job for tidyr ? @hadley ?

@romainfrancois
Copy link
Member

or perhaps

# controlling the names 
d %>% 
  summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% 
  flatten( quantiles, names = c("q25", "q75") )


# automatic names by default
# i.e. by checking that all items in the list column have the same names
d %>% 
  summarise( quantiles = list(quantile(wt, c(.25, .75)) ) %>% 
  flatten( quantiles )

@romainfrancois
Copy link
Member

Or flatten could be a pronoun, e.g.

d %>% 
  summarise( flatten( quantile(wt, c(.25, .75)), names = c("q25", "q75" ) ) )

@romainfrancois
Copy link
Member

In terms of syntax, I'd prefer something like flatten to be a verb.

Performance wise, making it a pronoun of summarise might be better as we would not have to allocate the list ...

@brianstamper
Copy link

I came across this problem today, and found to get the effect of "flatten" described above I could use the composition of t() and as.data.frame():

library("dplyr")
df <- data.frame(foo = rep(c("a", "b", "c"), 4), bar = 1:12)
> df
   foo bar
1    a   1
2    b   2
3    c   3
4    a   4
5    b   5
6    c   6
7    a   7
8    b   8
9    c   9
10   a  10
11   b  11
12   c  12
df.q <- df %>%
  group_by(foo) %>%
  summarise(quantiles = list(quantile(bar)))
q <- t(as.data.frame(df.q$quantiles))
row.names(q) <- NULL
df.q <- cbind(df.q, q)
df.q$quantiles <- NULL
> df.q
  foo 0%  25% 50%  75% 100%
1   a  1 3.25 5.5 7.75   10
2   b  2 4.25 6.5 8.75   11
3   c  3 5.25 7.5 9.75   12

@petersmp
Copy link

I am not sure that a separate row per output is the best approach here. Following from @brianstamper , I occasionally find myself doing things like this:

df %>%
  group_by(foo) %>%
  summarise_(.dots = setNames(paste0("summary(bar)[",1:6,"]"), names(summary(1:5))))

To get outputs like this:

# A tibble: 3 × 7
     foo  `Min. bar` `1st Qu. bar` `Median bar`  `Mean bar` `3rd Qu. bar`  `Max. bar`
  <fctr> <S3: table>   <S3: table>  <S3: table> <S3: table>   <S3: table> <S3: table>
1      a           1          3.25          5.5         5.5          7.75          10
2      b           2          4.25          6.5         6.5          8.75          11
3      c           3          5.25          7.5         7.5          9.75          12

I have found workarounds like this often (this is actually perhaps the cleanest I have found), but they never seem quite satisfying. It would be nice if I could use something like summarise directly, particularly when I want to also calculate other summaries (e.g., the mean of some other variable). Perhaps summarise_ is the right approach, but there may be value in making this more straightforward.

@hadley
Copy link
Member Author

hadley commented Feb 2, 2017

Now part of #2326

@hadley hadley closed this as completed Feb 2, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants