-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional parameter to control length of summarise #154
Comments
Returning more than one thing in
Unfortunately that does not parse.
However, this does:
Not sure I like this either. What worries me about
|
Let me add another voice to those requesting this functionality. For me, and others in finance, there is often a need to calculate --- for many companies and many dates --- measures like a trailing standard deviation. This is easy, for a single company, by using functions like rollapply() from library(xts). To be able to call such functions within dplyr would be wonderful, and would probably create a much wider user base within finance, or any community that uses a lot of times series data, for dplyr. |
Can we have something that marks the expected size of a result, e.g. :
or at least mark that something is expected to have more than one result:
Or something. I think retaining the default expectation of only one result otherwise makes sense and protects from mistakes. Having the user be explicit forces them to think about it. Also kind of like something like this:
Not sure that would play along with |
I was thinking of an additional argument to summarise - |
Does |
@romainfrancois I think every expression would have to return an object of the same length. |
That's easy enough then, but won't people want to do e.g.
or something ? |
Maybe. That would recycle max? If you allow multiple lengths, then you'll have to check all smaller lengths are divisors of the largest, e.g. this would need to be an error: summarise( quantile(mpg, c(0.25, 0.75)), quantile(mpg, c(0.25, 0.5, 0.75))) |
How about something that says automatic
or perhaps a sibling to |
I like that! If we can make that work without too much effort, it would be great to make that the default. |
( I was looking for this 👍 but then I realized it always comes uninvited whenever I just want a ":". ) It should not be too much work if data type is homogeneous, e.g. we get a length 4 numeric vector. fine. I'm worried about cases where we would e.g. get a list with different types, should we handle that... but I guess this is not a problem as we could say it makes a The other thing that troubles me a bit but is about naming the result columns. Say we have something like |
I'd say we'd just ignore the names - you could always add as a separate col: probs <- seq(0, 1, length = 5)
summarise(mtcars, probs, quantiles(mpg, probs)) |
Just trying to ease back in to this. When we get multiple results, e.g. quantile, are we expecting several columns, a matrix column, a list column, a data.frame column or whatever ? Or do we want to be a able to choose between some of these options. e.g. a list column would be an easy way to get whatever, no need to impose constraints on the individual results, could be of different sizes, whatever. Perhaps for something like |
It will be a vector (either atomic or list). (Potentially it could be a data frame or matrix, but I don't think we need to worry about that for now). Maybe instead of specifying mtcars %>%
summarise(q = quantile(mpg, c(0.25, 0.5, 0.75)), .out = list(q = double(3))) If |
Right, still don't get it. :/ Say we want
Potential cases I'm thinking of:
I'm not particularly happy with this one because Or:
Not completely happy with this one either because from one expression we would get several columns. Or something like this:
Or something else. Once I know what to aim for, the code should write itself fairly easily. |
Right, this would require we relax the constraint on Yet another interface would be to return a list-column like: do(d, q = quantile(.$wt, c(.25, .75)) Maybe we could make that work as is if you added an explicit summarise(d, list(quantile(wt, c(.25, .75))) It seems like we need something half-way between |
Back again here. Given #832 we can now use this syntax:
Could be some other function's job to flatten the list columns, e.g :
or something. This way |
Perhaps that's a job for tidyr ? @hadley ? |
or perhaps
|
Or
|
In terms of syntax, I'd prefer something like Performance wise, making it a pronoun of |
I came across this problem today, and found to get the effect of "flatten" described above I could use the composition of t() and as.data.frame(): library("dplyr")
df <- data.frame(foo = rep(c("a", "b", "c"), 4), bar = 1:12)
df.q <- df %>%
group_by(foo) %>%
summarise(quantiles = list(quantile(bar)))
q <- t(as.data.frame(df.q$quantiles))
row.names(q) <- NULL
df.q <- cbind(df.q, q)
df.q$quantiles <- NULL
|
I am not sure that a separate row per output is the best approach here. Following from @brianstamper , I occasionally find myself doing things like this:
To get outputs like this:
I have found workarounds like this often (this is actually perhaps the cleanest I have found), but they never seem quite satisfying. It would be nice if I could use something like |
Now part of #2326 |
It would be useful to have a parameter that states the number of values the function should return. It's sometimes useful to have a summary function that returns multiple values, like if you're computing a fixed set of quantiles. The values should run down the column, and the grouping variables should be repeated.
For example:
We don't need to label the values, since the user could always do that themselves:
When
n > 1
,summarise()
shouldn't drop the last group from the grouping, since you might want to summarise the values you just computed.The text was updated successfully, but these errors were encountered: