Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding aggregation parameter to spread() to manage multiple values for the same key #474

Closed
damianooldoni opened this issue Jul 6, 2018 · 13 comments

Comments

@damianooldoni
Copy link

commented Jul 6, 2018

The tidyr function spread() returns an error if the same key (here a) has multiple values:

> df <- data.frame(key = c("a","a"), value = c(2,3))
> df
  key value
1   a     2
2   a     3
> spread(df, key, value)
Error: Duplicate identifiers for rows (1, 2)

In our project we needed to spread on multiple rows in order to get this:

  a
1 2
2 3

Spreading these duplicates to multiple rows can be seen as a special case of the more general problem of handling duplicate identifiers. For example in Python (in the pandas package) and in Microsoft Excel (under the name pivot table) duplicate identifiers are handled using an aggregation function, which is currently not supported in tidyr.

With this in mind, we suggest an alternative implementation of spread which supports all current parameters, but adds the aggfunc parameter to define how any multiple rows for the same key should be aggregated (e.g. average, sum, …). The default value NULL does not aggregate, but keeps the multiple rows. An implementation of this is provided at spread_with_multiple_values(). Examples:

1. Dataframe without multiple values for same key

library(trias)
df_1 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "D"),
  value = c("R", "S", "T", "X"),
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_1, key, value)

Which produces the same result as tidyr::spread(df_1, key, value):

  col1 col2 A B C D
1    1    H R S T X

2. Dataframe with multiple values for same key + no aggfunc provided

library(trias)
df_2 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c("R", "S", "T", "X"), # multiple values T, X for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_2, key, value) # no “aggfunc” parameter

Which results in multiple rows for multiple values for the same key:

  col1 col2 A B C
1    1    H R S T
2    1    H R S X

While tidyr::spread(df_2, key, value) would throw an error.

Alternative ways to get the same result:

spread_with_multiple_values(df_2, 3, 4)
spread_with_multiple_values(df_2, -2, -1)
spread_with_multiple_values(df_2, "key", "value")

3. Dataframe with multiple values for same key + aggfunc provided

library(trias)
df_3 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c(2, 3, 1, 8), # multiple values 1, 8 for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
spread_with_multiple_values(df_3, key, value, aggfunc = min)
spread_with_multiple_values(df_3, key, value, aggfunc = mean)

Which results in:

> spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
# A tibble: 1 x 5
   col1 col2  A     B     C    
  <dbl> <chr> <chr> <chr> <chr>
1     1 H     2     3     1-8  
> spread_with_multiple_values(df_3, key, value, aggfunc = min)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3     1
> spread_with_multiple_values(df_3, key, value, aggfunc = mean)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3   4.5

spread_with_multiple_values() shows that it is possible to combine an aggregate function with the standard input parameters of tidyr::spread(). Would it be an option to extend the functionality of spread in this direction and support an aggfunc parameter? Any input is welcome, but I'm certainly willing to add a PR with this adaptation if it would provide added value.

@hadley

This comment has been minimized.

Copy link
Member

commented Jan 4, 2019

reshape::cast() used to do this and I deliberately moved away from it because it feels a bit outside the scope of data tidying, and into the realm of data aggregation. But I'll reconsider when thinking about spread()/gather() generally as part of #149.

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

Supporting an aggregation function in widening pivots is now trivial to implement, and I'll take a stab at it next week. If not supplied and the key ares are non-unique, I think it makes the most sense to return a list column, as it's not clear that duplicating the other measures makes sense, in general.

@hadley hadley added this to the v1.0.0 milestone Mar 3, 2019
@hadley

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

Alternatively, should this be a separate verb that simplifies list columns by applying an aggregation function? It might be useful in other cases, like where the list-cols come from JSON.

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

Probably, we can consider list() is the aggregation function for the case of list columns.

@hadley hadley closed this in 08a05d1 Mar 6, 2019
@hadley

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Implemented, but I don't love the argument name. @yutannihilation do you have any better ideas?

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

How about something like duplicate, which takes a policy about what to do when there's duplicate.

  • "warn": warn and create a list column
  • "error": raise an error
  • aggregation function: aggregate the result by applying the function
@hadley

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Last night I realised, that for the sake of consistency, if you supply an aggregation function, it should always be supplied regardless of whether there are duplicates or not. Otherwise, I think it's confusing for code like df %>% pivot_wide(values_collapse = length) to not compute lengths. I think that implies the name should be more about aggregation than duplicates or collapsing.

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Agreed. It makes sense.

In terms of consistency, I feel a bit uneasy that the column type can be different (the same type or a list-col) depending on whether there are some key or not. So I still think there should be an option to make it error when there's duplication, like separate()'s extra argument.

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Hmmmm, that's an interesting point. I think my counter argument would be that the type of the value_to column will always vary based on the data (i.e. is the input a character or integer) so if you're worried then you should make some other assertion earlier. But maybe we should still have a keys_check argument that would allow to check that each value has a unique identity?

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

Yes, the spec itself doesn't know the column types, and the users should be sure about the data by themselves. But, still, from the viewpoint of the user, I want to expect the integer columns if I spread an integer column, for example. I'm not sure, but my feeling that a length should not matter on the type of result is basically the same one as we feel about the existance of drop option of data.frame.[.

So, I believe we need keys_check.

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

What if instead of keys_check we call it values_ptype? That would allow also to solve both problems at once. (Assuming that we check at the end after we've created the list-col if needed, and the aggregation function has been applied)

  • values_ptype = NULL: the default, throws a warning.
  • values_ptype = list(): I expect duplicate keys, and I'm happy to get a nested list. No warning.
  • values_ptype = integer(): I expect integer values. Errors if duplicated key.
@hadley hadley reopened this Mar 8, 2019
@yutannihilation

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

Sounds good to me 👍

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 25, 2019

I implemented my idea using values_ptype, but I ended up not liking it, because it felt excessively fiddly. Instead I'm going to use values_fn = list(val = list) to always produce a list column, and for now, have no way to throw an error if the combination of key values is unexpected not unique.

That said, it just occurred me that you could always use something like this:

just_one <- function(x) {
  if (vec_size(x) == 1)
   return(x)
  stop("Unexpectedly non-unique keys")
}
@hadley hadley closed this in a8ede91 Mar 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.