New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding aggregation parameter to spread() to manage multiple values for the same key #474
Comments
|
Supporting an aggregation function in widening pivots is now trivial to implement, and I'll take a stab at it next week. If not supplied and the key ares are non-unique, I think it makes the most sense to return a list column, as it's not clear that duplicating the other measures makes sense, in general. |
Alternatively, should this be a separate verb that simplifies list columns by applying an aggregation function? It might be useful in other cases, like where the list-cols come from JSON. |
Probably, we can consider |
Implemented, but I don't love the argument name. @yutannihilation do you have any better ideas? |
How about something like
|
Last night I realised, that for the sake of consistency, if you supply an aggregation function, it should always be supplied regardless of whether there are duplicates or not. Otherwise, I think it's confusing for code like |
Agreed. It makes sense. In terms of consistency, I feel a bit uneasy that the column type can be different (the same type or a list-col) depending on whether there are some key or not. So I still think there should be an option to make it error when there's duplication, like |
Hmmmm, that's an interesting point. I think my counter argument would be that the type of the |
Yes, the spec itself doesn't know the column types, and the users should be sure about the data by themselves. But, still, from the viewpoint of the user, I want to expect the integer columns if I spread an integer column, for example. I'm not sure, but my feeling that a length should not matter on the type of result is basically the same one as we feel about the existance of So, I believe we need |
What if instead of
|
Sounds good to me 👍 |
I implemented my idea using That said, it just occurred me that you could always use something like this: just_one <- function(x) {
if (vec_size(x) == 1)
return(x)
stop("Unexpectedly non-unique keys")
} |
The
tidyr
functionspread()
returns an error if the same key (herea
) has multiple values:In our project we needed to spread on multiple rows in order to get this:
Spreading these duplicates to multiple rows can be seen as a special case of the more general problem of handling duplicate identifiers. For example in Python (in the
pandas
package) and in Microsoft Excel (under the name pivot table) duplicate identifiers are handled using an aggregation function, which is currently not supported intidyr
.With this in mind, we suggest an alternative implementation of
spread
which supports all current parameters, but adds theaggfunc
parameter to define how any multiple rows for the same key should be aggregated (e.g. average, sum, …). The default valueNULL
does not aggregate, but keeps the multiple rows. An implementation of this is provided atspread_with_multiple_values()
. Examples:1. Dataframe without multiple values for same key
Which produces the same result as
tidyr::spread(df_1, key, value)
:2. Dataframe with multiple values for same key + no aggfunc provided
Which results in multiple rows for multiple values for the same key:
While
tidyr::spread(df_2, key, value)
would throw an error.Alternative ways to get the same result:
3. Dataframe with multiple values for same key + aggfunc provided
Which results in:
spread_with_multiple_values()
shows that it is possible to combine an aggregate function with the standard input parameters oftidyr::spread()
. Would it be an option to extend the functionality ofspread
in this direction and support anaggfunc
parameter? Any input is welcome, but I'm certainly willing to add a PR with this adaptation if it would provide added value.The text was updated successfully, but these errors were encountered: