Description
The tidyr
function spread()
returns an error if the same key (here a
) has multiple values:
> df <- data.frame(key = c("a","a"), value = c(2,3))
> df
key value
1 a 2
2 a 3
> spread(df, key, value)
Error: Duplicate identifiers for rows (1, 2)
In our project we needed to spread on multiple rows in order to get this:
a
1 2
2 3
Spreading these duplicates to multiple rows can be seen as a special case of the more general problem of handling duplicate identifiers. For example in Python (in the pandas
package) and in Microsoft Excel (under the name pivot table) duplicate identifiers are handled using an aggregation function, which is currently not supported in tidyr
.
With this in mind, we suggest an alternative implementation of spread
which supports all current parameters, but adds the aggfunc
parameter to define how any multiple rows for the same key should be aggregated (e.g. average, sum, …). The default value NULL
does not aggregate, but keeps the multiple rows. An implementation of this is provided at spread_with_multiple_values()
. Examples:
1. Dataframe without multiple values for same key
library(trias)
df_1 <- data.frame(
col1 = c(1, 1, 1, 1),
col2 = c("H", "H", "H", "H"),
key = c("A", "B", "C", "D"),
value = c("R", "S", "T", "X"),
stringsAsFactors = FALSE
)
spread_with_multiple_values(df_1, key, value)
Which produces the same result as tidyr::spread(df_1, key, value)
:
col1 col2 A B C D
1 1 H R S T X
2. Dataframe with multiple values for same key + no aggfunc provided
library(trias)
df_2 <- data.frame(
col1 = c(1, 1, 1, 1),
col2 = c("H", "H", "H", "H"),
key = c("A", "B", "C", "C"),
value = c("R", "S", "T", "X"), # multiple values T, X for key C
stringsAsFactors = FALSE
)
spread_with_multiple_values(df_2, key, value) # no “aggfunc” parameter
Which results in multiple rows for multiple values for the same key:
col1 col2 A B C
1 1 H R S T
2 1 H R S X
While tidyr::spread(df_2, key, value)
would throw an error.
Alternative ways to get the same result:
spread_with_multiple_values(df_2, 3, 4)
spread_with_multiple_values(df_2, -2, -1)
spread_with_multiple_values(df_2, "key", "value")
3. Dataframe with multiple values for same key + aggfunc provided
library(trias)
df_3 <- data.frame(
col1 = c(1, 1, 1, 1),
col2 = c("H", "H", "H", "H"),
key = c("A", "B", "C", "C"),
value = c(2, 3, 1, 8), # multiple values 1, 8 for key C
stringsAsFactors = FALSE
)
spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
spread_with_multiple_values(df_3, key, value, aggfunc = min)
spread_with_multiple_values(df_3, key, value, aggfunc = mean)
Which results in:
> spread_with_multiple_values(df_3, key, value, aggfunc = str_c, collapse = "-")
# A tibble: 1 x 5
col1 col2 A B C
<dbl> <chr> <chr> <chr> <chr>
1 1 H 2 3 1-8
> spread_with_multiple_values(df_3, key, value, aggfunc = min)
# A tibble: 1 x 5
col1 col2 A B C
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 H 2 3 1
> spread_with_multiple_values(df_3, key, value, aggfunc = mean)
# A tibble: 1 x 5
col1 col2 A B C
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 H 2 3 4.5
spread_with_multiple_values()
shows that it is possible to combine an aggregate function with the standard input parameters of tidyr::spread()
. Would it be an option to extend the functionality of spread
in this direction and support an aggfunc
parameter? Any input is welcome, but I'm certainly willing to add a PR with this adaptation if it would provide added value.