Adding aggregation parameter to spread() to manage multiple values for the same key #474

damianooldoni · 2018-07-06T09:48:55Z

The tidyr function spread() returns an error if the same key (here a) has multiple values:

> df <- data.frame(key = c("a","a"), value = c(2,3))
> df
  key value
1   a     2
2   a     3
> spread(df, key, value)
Error: Duplicate identifiers for rows (1, 2)

In our project we needed to spread on multiple rows in order to get this:

  a
1 2
2 3

Spreading these duplicates to multiple rows can be seen as a special case of the more general problem of handling duplicate identifiers. For example in Python (in the pandas package) and in Microsoft Excel (under the name pivot table) duplicate identifiers are handled using an aggregation function, which is currently not supported in tidyr.

With this in mind, we suggest an alternative implementation of spread which supports all current parameters, but adds the aggfunc parameter to define how any multiple rows for the same key should be aggregated (e.g. average, sum, …). The default value NULL does not aggregate, but keeps the multiple rows. An implementation of this is provided at spread_with_multiple_values(). Examples:

1. Dataframe without multiple values for same key

library(trias)
df_1 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "D"),
  value = c("R", "S", "T", "X"),
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_1, key, value)

Which produces the same result as tidyr::spread(df_1, key, value):

  col1 col2 A B C D
1    1    H R S T X

2. Dataframe with multiple values for same key + no aggfunc provided

library(trias)
df_2 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c("R", "S", "T", "X"), # multiple values T, X for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_2, key, value) # no “aggfunc” parameter

Which results in multiple rows for multiple values for the same key:

  col1 col2 A B C
1    1    H R S T
2    1    H R S X

While tidyr::spread(df_2, key, value) would throw an error.

Alternative ways to get the same result:

spread_with_multiple_values(df_2, 3, 4)
spread_with_multiple_values(df_2, -2, -1)
spread_with_multiple_values(df_2, "key", "value")

3. Dataframe with multiple values for same key + aggfunc provided

library(trias)
df_3 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c(2, 3, 1, 8), # multiple values 1, 8 for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
spread_with_multiple_values(df_3, key, value, aggfunc = min)
spread_with_multiple_values(df_3, key, value, aggfunc = mean)

Which results in:

> spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
# A tibble: 1 x 5
   col1 col2  A     B     C    
  <dbl> <chr> <chr> <chr> <chr>
1     1 H     2     3     1-8  
> spread_with_multiple_values(df_3, key, value, aggfunc = min)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3     1
> spread_with_multiple_values(df_3, key, value, aggfunc = mean)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3   4.5

spread_with_multiple_values() shows that it is possible to combine an aggregate function with the standard input parameters of tidyr::spread(). Would it be an option to extend the functionality of spread in this direction and support an aggfunc parameter? Any input is welcome, but I'm certainly willing to add a PR with this adaptation if it would provide added value.

The text was updated successfully, but these errors were encountered:

hadley · 2019-01-04T16:00:28Z

reshape::cast() used to do this and I deliberately moved away from it because it feels a bit outside the scope of data tidying, and into the realm of data aggregation. But I'll reconsider when thinking about spread()/gather() generally as part of #149.

hadley · 2019-03-03T14:46:06Z

Supporting an aggregation function in widening pivots is now trivial to implement, and I'll take a stab at it next week. If not supplied and the key ares are non-unique, I think it makes the most sense to return a list column, as it's not clear that duplicating the other measures makes sense, in general.

hadley · 2019-03-04T18:09:05Z

Alternatively, should this be a separate verb that simplifies list columns by applying an aggregation function? It might be useful in other cases, like where the list-cols come from JSON.

yutannihilation · 2019-03-05T04:41:40Z

Probably, we can consider list() is the aggregation function for the case of list columns.

hadley · 2019-03-06T23:17:16Z

Implemented, but I don't love the argument name. @yutannihilation do you have any better ideas?

yutannihilation · 2019-03-07T00:34:54Z

How about something like duplicate, which takes a policy about what to do when there's duplicate.

"warn": warn and create a list column
"error": raise an error
aggregation function: aggregate the result by applying the function

hadley · 2019-03-07T13:36:43Z

Last night I realised, that for the sake of consistency, if you supply an aggregation function, it should always be supplied regardless of whether there are duplicates or not. Otherwise, I think it's confusing for code like df %>% pivot_wide(values_collapse = length) to not compute lengths. I think that implies the name should be more about aggregation than duplicates or collapsing.

yutannihilation · 2019-03-07T14:39:02Z

Agreed. It makes sense.

In terms of consistency, I feel a bit uneasy that the column type can be different (the same type or a list-col) depending on whether there are some key or not. So I still think there should be an option to make it error when there's duplication, like separate()'s extra argument.

hadley · 2019-03-07T15:21:27Z

Hmmmm, that's an interesting point. I think my counter argument would be that the type of the value_to column will always vary based on the data (i.e. is the input a character or integer) so if you're worried then you should make some other assertion earlier. But maybe we should still have a keys_check argument that would allow to check that each value has a unique identity?

yutannihilation · 2019-03-08T02:04:59Z

Yes, the spec itself doesn't know the column types, and the users should be sure about the data by themselves. But, still, from the viewpoint of the user, I want to expect the integer columns if I spread an integer column, for example. I'm not sure, but my feeling that a length should not matter on the type of result is basically the same one as we feel about the existance of drop option of data.frame.[.

So, I believe we need keys_check.

hadley · 2019-03-08T14:00:39Z

What if instead of keys_check we call it values_ptype? That would allow also to solve both problems at once. (Assuming that we check at the end after we've created the list-col if needed, and the aggregation function has been applied)

values_ptype = NULL: the default, throws a warning.
values_ptype = list(): I expect duplicate keys, and I'm happy to get a nested list. No warning.
values_ptype = integer(): I expect integer values. Errors if duplicated key.

yutannihilation · 2019-03-09T09:33:41Z

Sounds good to me 👍

hadley · 2019-03-25T20:57:25Z

I implemented my idea using values_ptype, but I ended up not liking it, because it felt excessively fiddly. Instead I'm going to use values_fn = list(val = list) to always produce a list column, and for now, have no way to throw an error if the combination of key values is unexpected not unique.

That said, it just occurred me that you could always use something like this:

just_one <- function(x) {
  if (vec_size(x) == 1)
   return(x)
  stop("Unexpectedly non-unique keys")
}

hadley added feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes" labels Jan 4, 2019

hadley added this to the v1.0.0 milestone Mar 3, 2019

hadley closed this as completed in 08a05d1 Mar 6, 2019

hadley reopened this Mar 8, 2019

hadley closed this as completed in a8ede91 Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding aggregation parameter to spread() to manage multiple values for the same key #474

Adding aggregation parameter to spread() to manage multiple values for the same key #474

damianooldoni commented Jul 6, 2018

hadley commented Jan 4, 2019

hadley commented Mar 3, 2019

hadley commented Mar 4, 2019

yutannihilation commented Mar 5, 2019

hadley commented Mar 6, 2019

yutannihilation commented Mar 7, 2019 •

edited

hadley commented Mar 7, 2019

yutannihilation commented Mar 7, 2019

hadley commented Mar 7, 2019

yutannihilation commented Mar 8, 2019

hadley commented Mar 8, 2019

yutannihilation commented Mar 9, 2019

hadley commented Mar 25, 2019

Adding aggregation parameter to spread() to manage multiple values for the same key #474

Adding aggregation parameter to spread() to manage multiple values for the same key #474

Comments

damianooldoni commented Jul 6, 2018

1. Dataframe without multiple values for same key

2. Dataframe with multiple values for same key + no aggfunc provided

3. Dataframe with multiple values for same key + aggfunc provided

hadley commented Jan 4, 2019

hadley commented Mar 3, 2019

hadley commented Mar 4, 2019

yutannihilation commented Mar 5, 2019

hadley commented Mar 6, 2019

yutannihilation commented Mar 7, 2019 • edited

hadley commented Mar 7, 2019

yutannihilation commented Mar 7, 2019

hadley commented Mar 7, 2019

yutannihilation commented Mar 8, 2019

hadley commented Mar 8, 2019

yutannihilation commented Mar 9, 2019

hadley commented Mar 25, 2019

yutannihilation commented Mar 7, 2019 •

edited