Skip to content

Adding aggregation parameter to spread() to manage multiple values for the same key #474

Closed
@damianooldoni

Description

@damianooldoni

The tidyr function spread() returns an error if the same key (here a) has multiple values:

> df <- data.frame(key = c("a","a"), value = c(2,3))
> df
  key value
1   a     2
2   a     3
> spread(df, key, value)
Error: Duplicate identifiers for rows (1, 2)

In our project we needed to spread on multiple rows in order to get this:

  a
1 2
2 3

Spreading these duplicates to multiple rows can be seen as a special case of the more general problem of handling duplicate identifiers. For example in Python (in the pandas package) and in Microsoft Excel (under the name pivot table) duplicate identifiers are handled using an aggregation function, which is currently not supported in tidyr.

With this in mind, we suggest an alternative implementation of spread which supports all current parameters, but adds the aggfunc parameter to define how any multiple rows for the same key should be aggregated (e.g. average, sum, …). The default value NULL does not aggregate, but keeps the multiple rows. An implementation of this is provided at spread_with_multiple_values(). Examples:

1. Dataframe without multiple values for same key

library(trias)
df_1 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "D"),
  value = c("R", "S", "T", "X"),
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_1, key, value)

Which produces the same result as tidyr::spread(df_1, key, value):

  col1 col2 A B C D
1    1    H R S T X

2. Dataframe with multiple values for same key + no aggfunc provided

library(trias)
df_2 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c("R", "S", "T", "X"), # multiple values T, X for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_2, key, value) # no “aggfunc” parameter

Which results in multiple rows for multiple values for the same key:

  col1 col2 A B C
1    1    H R S T
2    1    H R S X

While tidyr::spread(df_2, key, value) would throw an error.

Alternative ways to get the same result:

spread_with_multiple_values(df_2, 3, 4)
spread_with_multiple_values(df_2, -2, -1)
spread_with_multiple_values(df_2, "key", "value")

3. Dataframe with multiple values for same key + aggfunc provided

library(trias)
df_3 <- data.frame(
  col1 = c(1, 1, 1, 1),
  col2 = c("H", "H", "H", "H"),
  key = c("A", "B", "C", "C"),
  value = c(2, 3, 1, 8), # multiple values 1, 8 for key C
  stringsAsFactors = FALSE
)
spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
spread_with_multiple_values(df_3, key, value, aggfunc = min)
spread_with_multiple_values(df_3, key, value, aggfunc = mean)

Which results in:

> spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
# A tibble: 1 x 5
   col1 col2  A     B     C    
  <dbl> <chr> <chr> <chr> <chr>
1     1 H     2     3     1-8  
> spread_with_multiple_values(df_3, key, value, aggfunc = min)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3     1
> spread_with_multiple_values(df_3, key, value, aggfunc = mean)
# A tibble: 1 x 5
   col1 col2      A     B     C
  <dbl> <chr> <dbl> <dbl> <dbl>
1     1 H         2     3   4.5

spread_with_multiple_values() shows that it is possible to combine an aggregate function with the standard input parameters of tidyr::spread(). Would it be an option to extend the functionality of spread in this direction and support an aggfunc parameter? Any input is welcome, but I'm certainly willing to add a PR with this adaptation if it would provide added value.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurea feature request or enhancementpivoting ♻️pivot rectangular data to different "shapes"

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions